Java RegEx: Replace all xml characters with their entity number

Java RegEx: Replace all xml characters with their entity number - java

I am trying to port a function I wrote in ActionScript to Java and I am having a bit of trouble. I have included the function below. I found this response to question #375420, but do I really need to write a separate class? Thanks.
public static function replaceXML(str:String):String {
return str.replace(/[\"'&<>]/g, function($0:String):String {
return StringUtil.substitute('&#{0};', $0.charCodeAt(0));
});
}
Input
<root><child id="foo">Bar</child></root>
Output
<root><child id="foo">Bar</child></root>
UPDATE
Here is my solution if anyone is wondering. Thanks Sri Harsha Chilakapati.
public static String replaceXML(final String inputStr) {
String outputStr = inputStr;
Matcher m = Pattern.compile("[&<>'\"]").matcher(outputStr);
String found = "";
while (m.find()) {
found = m.group();
outputStr = outputStr.replaceAll(found,
String.format("&#%d;", (int)found.charAt(0)));
}
return outputStr;
}

You can use regex for that.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
String myString = "<root><child id=\"foo\">Bar</child></root>";
Matcher m = Pattern.compile("[^\\p{L}\\p{N};\"+*/-]").matcher(myString);
while (m.find()) {
String found = m.group();
myString = myString.replaceAll(found, "&#" + (int)found.charAt(0) + ";");
}
System.out.println(myString);
It's working.
Output is
<root><child id="foo">Bar</child>&60;/root>

Well Java is an object oriented language, and therefore working with objects. Usually you can create a Util class, e.g. RegExUtil and provide a static method to invoke the method from any other class. The util class itself, shouldn't be instantiated. You can achieve that with a private Constructor.
public class RegExUtil {
private RegExUtil(){
//do nth.
}
public static String replaceXML(String input){
//do sth.
}
}
You should lookup Apache Commons first, because they may already provide a solution for your objective or at least you see how Util classes are made up.

Related

Sonarqube: How to get the expression string when writing custom java rules?

The target class is:
class Example{
public void m(){
System.out.println("Hello" + 1);
}
}
I want to get the full string of MethodInvocation "System.out.println("Hello" + 1)" for some regex check. How to write?
public class Rule extends BaseTreeVisitor implements JavaFileScanner {
#Override
public void visitMethodInvocation(MethodInvocationTree tree) {
//get the string of MethodInvocation
//some regex check
super.visitMethodInvocation(tree);
}
}
I wrote some code inspection rules using eclipse jdt and idea psi whose expression tree node has these attributes. I wonder why sonar's just has first and last token instead.
Thanks!

An old question, but I have a solution.
This works for any sort of tree.
#Override
public void visitMethodInvocation(MethodInvocationTree tree) {
int firstLine = tree.firstToken().line();
int lastLine = tree.lastToken().line();
String rawText = getRelevantLines(firstLine, lastLine);
// do your thing here with rawText
}
private String getRelevantLines(int startLine, int endLine) {
StringBuilder builder = new StringBuilder();
context.getFileLines().subList(startLine, endLine).forEach(builder::append);
return builder.toString();
}
If you want to refine further, you can also use firstToken().column or perhaps use the method name in your regex.
If you want more lines/bigger scope, just use the parent of that tree tree.parent()
This will also handle cases where the expression/params/etc span multiple lines.
There might be a better way... but I don't know of any other way. May update if I figure out something better.

Replace string patterns recursively in java

I have a utility class to resolve a string input with certain patterns as shown in the example below. All variables are surrounded by { and }. If my string is something like Language is {lang} and version 2 is {version}. Home located at {java.home} the output is Language is java and version 2 is 1.8. Home located at C:/java and if my string is like Language is {lang} and version 2 is {version}. Home located at {{lang}.home} the output is Language is java and version 2 is 1.8. Home located at {java.home}. All I am trying to find is a way to resolve nested properties recursively but ran into several issues. Can any logic be inserted into the code so that resolving of inner properties happen dynamically?
import java.util.*;
import java.util.regex.*;
public class MyClass {
public static void main(String args[]) {
System.setProperty("lang" , "java");
System.setProperty("version" , "1.8");
System.setProperty("java.home" , "C:/java");
System.out.println(resolve("Language is {lang} and version 2 is {version}. Home located at {java.home}"));
System.out.println(resolve("Language is {lang} and version 2 is {version}. Home located at {{lang}.home}"));
}
public static String resolve(String input) {
List<String> tokens = matchers("[{]\\S+[}]", input);
String value;
for(String token : tokens) {
value = getProperty(token);
if (null != value) {
input = input.replace(token, value);
}
value = "";
}
return input;
}
private static String getProperty(String key) {
key = key.substring(1, key.length()-1);
return System.getProperty(key);
}
public static List<String> matchers(String regex, String text) {
List<String> matches = new ArrayList<String>();
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
matches.add(matcher.group());
}
return matches;
}
public static boolean contains(String regex, String text) {
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
return matcher.find();
}
}

You just have to ask for the pattern to get only the value without an inner { or } with [^{}]. No "curly bracket" means no inner values. So you can safely do the replace.
First, we create a Pattern, we need to escape those {}... and we add a capture group for later.
Pattern p = Pattern.compile("\\{([^{}]+)\\}");
Then we check with the current value:
Matcher m = p.matcher(s);
Now, we just have to check if there is a match and loop on it.
while( m.find() ){
...
}
In there, we will need the value captured, so we get the first group and get its value (let assume it will always be present) :
String key = m.group(1);
String value = properties.get(key); //add some fail safe.
Using the Matcher.replaceFirst, we will safely replace only the current match (the one we get the value from). If you use replaceAll, it will replace every pattern with the same value.
s = m.replaceFirst(properties.get(key));
Now, since we have updated the String, we need to call check the regex again :
m = p.matcher(s);
Here is a full example:
Map<String, String> properties = new HashMap<>();
properties.put("lang", "java");
properties.put("java.version", "1.8");
String s = "This is {{lang}.version}.";
Pattern p = Pattern.compile("\\{([^{}]+)\\}");
Matcher m = p.matcher(s);
while(m.find()){
String key = m.group(1);
s = m.replaceFirst(properties.get(key));
System.out.println(s);
m = p.matcher(s); //Reset the matcher
}
This is {java.version}.
This is 1.8.
This has one problem, it will required to a lot of Matcher initialisation, so it might not be optimal. Of course, it is most likely not optimized (not the point here)
FYI : Using the Matcher.replaceFirst instead of the String.replaceFirst prevent a new Pattern compilation to be done. Here is the String.replaceFirst code :
public String replaceFirst(String regex, String replacement) {
return Pattern.compile(regex).matcher(this).replaceFirst(replacement);
}
We already have a Matcher to do that, so use it.

There are lots of ways you could achieve this.
You need some way to communicate to the caller either whether a replacement is necessary, or whether one was made.
A simple option:
public boolean hasPlaceholder(String s) {
// return true if s contains a {} placeholder, else false
}
Using this you can repeatedly replace until done:
while(hasPlaceholder(s)) {
s = replacePlaceholders(s);
}
This does scan through the string more times than is strictly necessary, but you shouldn't optimise prematurely.
A more sophisticated option is for the replacePlaceholders() method to report back whether it succeeded. For that you'll need a response class that wraps the result String and the wasReplaced() boolean:
ReplacementResult replacePlaceholders(String s) {
// process string into newString, counting placeholders replaced
return new ReplacementResult(count > 0, newString);
}
(Implementation of ReplacementResult left as an exercise)
Using this you can do:
ReplacementResult result = replacePlaceholders(s);
while(result.wasReplaced()) {
result = replacePlaceholders(result.string());
}
So, each time you call replacePlaceholders() it will either make at least one replacement, or it will report false having verified that there are no more replacements to make.
You mention recursion in the question. This can of course be done, and it would mean avoiding scanning through the whole string each time -- as you can look at just the replacement fragment. This is untested Java-like pseudocode:
String replaceRecursively(String s) {
StringBuilder result = new StringBuilder();
while(Token token = takeTokenFrom(s)) {
if(token.isPlaceholder()) {
String rawReplacement = lookupReplacement(token);
String processedReplacement = replaceRecursively(rawReplacement);
result.append(processedReplacement);
} else {
result.append(token.text());
}
}
return result.toString();
}
For all of these solutions, you should beware of infinite loops or stack-blowing recursion. What if you replace "{foo}" with "{foo}"? (or worse, what if you replace "{foo}" with "{foo}{foo}"!?).
Of course the simplest way is to be in control of the configuration, and simply not trigger that problem. Detecting the problem programatically is entirely possible, but complex enough that it would warrant another SO question if you want it.

Untangling Inherited Methods

Okay, so this is my first time implementing classes, and everything's going wrong. I'm implimenting a different class, PhraseGenerator, and the method inherited which I wish to define here is getPhrase(). It needs to return theArcha. Instead of working within it, I chose to wrap its braces around my work afterwards, and now, no matter where I put it, a different error arises. Before dealing with any of these, I want to make sure I'm putting it in the right place. To my understanding, it would go within public....FromFile implements PhraseGenerator. Any thoughts on where it should go?
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.List;
import java.util.StringTokenizer;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class PhraseGeneratorFromFile implements PhraseGenerator {
private ParserHelperImpl parserHelper;
public String getPhrase() {
public PhraseGeneratorFromFile(String filename) {
// read file
StringBuilder fileContent = new StringBuilder();
BufferedReader br = new BufferedReader(new FileReader(filename));
try {
String line = br.readLine();
while (line != null) {
fileContent.append(line);
fileContent.append('\n');
line = br.readLine();
}
String everything = fileContent.toString();
} finally {
br.close();
}
parserHelper = new ParserHelperImpl();
List<String> phraseCollection = parserHelper.getPhrases(fileContent,"phrases:");
String archetype = parserHelper.getRandomElement(phraseCollection);
boolean flagga = true;
while(flagga = true){
Pattern ptrn = Pattern.compile("#[^#]+#");
Matcher m = ptrn.matcher(archetype);
String fromMatcher = m.group(0);
String col = ":";
String token = fromMatcher+col;
List<String> pCol = parserHelper.getPhrases(fileContent, token);
String repl = parserHelper.getRandomElement(pCol);
String hash = "#";
String tk2 = hash + token + hash;
archetype = parserHelper.replace(archetype, tk2, repl);
flagga = m.find();
}
String theArcha = archetype;
return theArcha;
}
}
}

A good practice while posting a question here is :
(1). Explain in brief what you expect off your code to do.
(2). If you are experiencing certain errors, copy them here so that it can be understood what is going wrong in your code.
I seriously did not understood what you were trying to achieve but I see a missing closing bracket in
public String getPhrase()
It should be :
public String getPhrase()
{
//logic here
}
Hope this helps

Yes, it is in the right place but you are missing the closing }, which should come directly after the {. You can't put a method inside another method like that.
Because you want to return theArcha, you should instead make it what we call "an instance variable" - you may not have heard of this? If not, look it up.

Your interface is probably like this
interface PhraseGenerator {
String getPhrase();
}
Then the implementing class you wrote will take the form
class PhraseGeneratorImpl implements PhraseGenerator {
private ParserHelperImpl parserHelper;
#Override //Used for an overridden or implemented method
public String getPhrase() {
//Put all the code you want to implement here..
//If you want to make use of a helper Class the clean way is to use an instance of it(You tried it with Helper)
//If you want to make use of a utility method within the same class,
//say reading something from the file system define a private method below this method
String filePhrase = phraseGeneratorFromFile();
//Now use the filePhrase do do other stuff
}
//
private String phraseGeneratorFromFile(){
//Do all the stuff and return phrase/string so declare return type. you havent done it in the code above
}
}

How can one implement in Java the same as JavaScript String.replace(RegEx, function)?

I would like to do some simple String replace with a regular expression in Java, but the replace value is not static and I would like it to be dynamic like it happens on JavaScript.
I know I can make:
"some string".replaceAll("some regex", "new value");
But i would like something like:
"some string".replaceAll("some regex", new SomeThinkIDontKnow() {
public String handle(String group) {
return "my super dynamic string group " + group;
}
});
Maybe there is a Java way to do this but i am not aware of it...

You need to use the Java regex API directly.
Create a Pattern object for your regex (this is reusable), then call the matcher() method to run it against your string.
You can then call find() repeatedly to loop through each match in your string, and assemble a replacement string as you like.

Here is how such a replacement can be implemented.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegExCustomReplacementExample
{
public static void main(String[] args)
{
System.out.println(
new ReplaceFunction() {
public String handle(String group)
{
return "«"+group.substring(1, group.length()-1)+"»";
}
}.replace("A simple *test* string", "\\*.*?\\*"));
}
}
abstract class ReplaceFunction
{
public String replace(String source, String regex)
{
final Pattern pattern = Pattern.compile(regex);
final Matcher m = pattern.matcher(source);
boolean result = m.find();
if(result) {
StringBuilder sb = new StringBuilder(source.length());
int p=0;
do {
sb.append(source, p, m.start());
sb.append(handle(m.group()));
p=m.end();
} while (m.find());
sb.append(source, p, source.length());
return sb.toString();
}
return source;
}
public abstract String handle(String group);
}
Might look a bit complicated at the first time but that doesn’t matter as you need it only once. The subclasses implementing the handle method look simpler. An alternative is to pass the Matcher instead of the match String (group 0) to the handle method as it offers access to all groups matched by the pattern (if the pattern created groups).

Removing accents from String

Recentrly I found very helpful method in StringUtils library which is
StringUtils.stripAccents(String s)
I found it really helpful with removing any special characters and converting it to some ASCII "equivalent", for instace ç=c etc.
Now I am working for a German customer who really needs to do such a thing but only for non-German characters. Any umlauts should stay untouched. I realised that strinAccents won't be useful in that case.
Does anyone has some experience around that stuff?
Are there any useful tools/libraries/classes or maybe regular expressions?
I tried to write some class which is parsing and replacing such characters but it can be very difficult to build such map for all languages...
Any suggestions appriciated...

Best built a custom function. It can be like the following. If you want to avoid the conversion of a character, you can remove the relationship between the two strings (the constants).
private static final String UNICODE =
"ÀàÈèÌìÒòÙùÁáÉéÍíÓóÚúÝýÂâÊêÎîÔôÛûŶŷÃãÕõÑñÄäËëÏïÖöÜüŸÿÅåÇçŐőŰű";
private static final String PLAIN_ASCII =
"AaEeIiOoUuAaEeIiOoUuYyAaEeIiOoUuYyAaOoNnAaEeIiOoUuYyAaCcOoUu";
public static String toAsciiString(String str) {
if (str == null) {
return null;
}
StringBuilder sb = new StringBuilder();
for (int index = 0; index < str.length(); index++) {
char c = str.charAt(index);
int pos = UNICODE.indexOf(c);
if (pos > -1)
sb.append(PLAIN_ASCII.charAt(pos));
else {
sb.append(c);
}
}
return sb.toString();
}
public static void main(String[] args) {
System.out.println(toAsciiString("Höchstalemannisch"));
}

My gut feeling tells me the easiest way to do this would be to just list allowed characters and strip accents from everything else. This would be something like
import java.util.regex.*;
import java.text.*;
public class Replacement {
public static void main(String args[]) {
String from = "aoeåöäìé";
String result = stripAccentsFromNonGermanCharacters(from);
System.out.println("Result: " + result);
}
private static String patternContainingAllValidGermanCharacters =
"a-zA-Z0-9äÄöÖéÉüÜß";
private static Pattern nonGermanCharactersPattern =
Pattern.compile("([^" + patternContainingAllValidGermanCharacters + "])");
public static String stripAccentsFromNonGermanCharacters(
String from) {
return stripAccentsFromCharactersMatching(
from, nonGermanCharactersPattern);
}
public static String stripAccentsFromCharactersMatching(
String target, Pattern myPattern) {
StringBuffer myStringBuffer = new StringBuffer();
Matcher myMatcher = myPattern.matcher(target);
while (myMatcher.find()) {
myMatcher.appendReplacement(myStringBuffer,
stripAccents(myMatcher.group(1)));
}
myMatcher.appendTail(myStringBuffer);
return myStringBuffer.toString();
}
// pretty much the same thing as StringUtils.stripAccents(String s)
// used here so I can demonstrate the code without StringUtils dependency
public static String stripAccents(String text) {
return Normalizer.normalize(text,
Normalizer.Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
}
(I realize the pattern doesn't probably contain all the characters needed, but add whatever is missing)

This might give you a work around. here you can detect the language and get the specific text only.
EDIT:
You can have the raw string as an input, put the language detection to German and then it will detect the German characters and will discard the remaining.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java RegEx: Replace all xml characters with their entity number - java

Related

Sonarqube: How to get the expression string when writing custom java rules?

Replace string patterns recursively in java

Untangling Inherited Methods

How can one implement in Java the same as JavaScript String.replace(RegEx, function)?

Removing accents from String

Categories

Resources