Efficient encoding all the special characters in a string into entities - java

I have a string like this "abcd !#&$%^^&*()<>!/". I have list of all the entity codes for characters in a separate string i.e. only encode those characters which are in another string "!=&4....^=9...". I want to convert all of special characters into their entities except alphanumeric by regex as using loop on characters on by one is too slow.
e.g. it should show "abc &#4..;&#4.." in other convert words all the special characters on keyboard.
Is there an efficient regex I can write ? I have tried this with loops but it is too slow to look at each character one by one and maintain a list of all special characters entities in other string
There are libraries but they do not convert all of the characters.
The code I wrote
// String to be encoded
String sDecoded = "abcd !##$%^&*();'m,";
// Special character entity list to put instead to special character. It is tokenized on cross and divide symbol as it cannot be entered by user on keyboard
String specialCharacters = "&÷$amp;×–÷–"
// Check the input
if (sDecoded == null || sDecoded.trim ().length () == 0)
return (sDecoded);
// Use StringTokenizer which is faster than split method
StringTokenizer st = new StringTokenizer(specialCharacters, "×");
String[] reg = null;
String[] charactersArray = sDecoded.split("");
String sEncoded = "";
// now loop on it and in each iteration, we will be getting a decodedCharacter:EncodedEntity pair
for(int i = 0; i < charactersArray.length; i++)
{
st = new StringTokenizer(specialCharacters, "×");
while(st.hasMoreElements())
{
reg = st.nextElement().toString().split("÷");
// This is an error, the character should not be blank ever because it will be character that we will encode
if(StringUtils.isBlank(reg[0]))
return sDecoded;
String c = charactersArray[i];
if(c.equalsIgnoreCase(reg[0]))
{
sEncoded = sEncoded + c.replace(reg[0], reg[1]);
break;
}
if(st.countTokens() == 0)
sEncoded = sEncoded + c.toString();
}
}
return (sEncoded);

I don't know what definition of "efficient" you are using, but there's the "don't reinvent the wheel" efficiency of using a simple call to Apache commons-text StringEscapeUtils utility class:
String encoded = StringEscapeUtils.escapeXml11(str);
or
String encoded = StringEscapeUtils.escapeHtml4(str);
and a variety of other similar methods, depending on which exact encoding you want.
Note: This class was originally in the commons-lang3 library, but was deprecated there and moved to the commons-text library.

Your approach is quite slow and inefficient. Maybe it looks elegant nowadays to use regex like a silver bullet for everything, but it is definitely not for this task. I see you are also using tokenizer which is also slow.Also loop inside a loop will degrade performance.
I would recomment using an iterative way with string builder which will produce blazing fast results, you will try for yourself. For each special character make an 'if' statement. Even if it looks too much code it will be very fast. Test yourself.
Try this :
class Scratch {
public static void main(String[] args) {
System.out.println(escapeSpecials("abc &"));
}
public static String escapeSpecials(String origin) {
StringBuilder result = new StringBuilder();
char[] chars = origin.toCharArray();
for (char c : chars) {
if (c == '&') {
result.append("&");
} else if (c == '\u2013') {
result.append("–");
} else {
// not a special character
result.append(c);
}
}
return result.toString();
}
}

Related

Splitting csv lines that use "escaped" delimiter [duplicate]

I have a string vaguely like this:
foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"
that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)
the above string should split into:
foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"
note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure
Try:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
Output:
> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"
In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.
Or, a bit friendlier for the eyes:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);
String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
which produces the same as the first example.
EDIT
As mentioned by #MikeFHay in the comments:
I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:
Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))
While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
else if (input.charAt(current) == ',' && !inQuotes) {
result.add(input.substring(start, current));
start = current + 1;
}
}
result.add(input.substring(start));
If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
char currentChar = builder.charAt(currentIndex);
if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
if (currentChar == ',' && inQuotes) {
builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
}
}
List<String> result = Arrays.asList(builder.toString().split(","));
http://sourceforge.net/projects/javacsv/
https://github.com/pupi1985/JavaCSV-Reloaded
(fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)
http://opencsv.sourceforge.net/
CSV API for Java
Can you recommend a Java library for reading (and possibly writing) CSV files?
Java lib or app to convert CSV to XML file?
I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as Fabian proposed). I've tried regex solution and own parsing implementation I have found that:
Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
My solution and test below.
String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;
start = System.nanoTime();
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
switch (c) {
case ',':
if (inQuotes) {
b.append(c);
} else {
tokensList.add(b.toString());
b = new StringBuilder();
}
break;
case '\"':
inQuotes = !inQuotes;
default:
b.append(c);
break;
}
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;
System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);
Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.
You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.
If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one
I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):
final static private Pattern splitSearchPattern = Pattern.compile("[\",]");
private List<String> splitByCommasNotInQuotes(String s) {
if (s == null)
return Collections.emptyList();
List<String> list = new ArrayList<String>();
Matcher m = splitSearchPattern.matcher(s);
int pos = 0;
boolean quoteMode = false;
while (m.find())
{
String sep = m.group();
if ("\"".equals(sep))
{
quoteMode = !quoteMode;
}
else if (!quoteMode && ",".equals(sep))
{
int toPos = m.start();
list.add(s.substring(pos, toPos));
pos = m.end();
}
}
if (pos < s.length())
list.add(s.substring(pos));
return list;
}
(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)
Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".
The simplest approach is not to match delimiters, i.e. commas, with a complex additional logic to match what is actually intended (the data which might be quoted strings), just to exclude false delimiters, but rather match the intended data in the first place.
The pattern consists of two alternatives, a quoted string ("[^"]*" or ".*?") or everything up to the next comma ([^,]+). To support empty cells, we have to allow the unquoted item to be empty and to consume the next comma, if any, and use the \\G anchor:
Pattern p = Pattern.compile("\\G\"(.*?)\",?|([^,]*),?");
The pattern also contains two capturing groups to get either, the quoted string’s content or the plain content.
Then, with Java 9, we can get an array as
String[] a = p.matcher(input).results()
.map(m -> m.group(m.start(1)<0? 2: 1))
.toArray(String[]::new);
whereas older Java versions need a loop like
for(Matcher m = p.matcher(input); m.find(); ) {
String token = m.group(m.start(1)<0? 2: 1);
System.out.println("found: "+token);
}
Adding the items to a List or an array is left as an excise to the reader.
For Java 8, you can use the results() implementation of this answer, to do it like the Java 9 solution.
For mixed content with embedded strings, like in the question, you can simply use
Pattern p = Pattern.compile("\\G((\"(.*?)\"|[^,])*),?");
But then, the strings are kept in their quoted form.
what about a one-liner using String.split()?
String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );
A regular expression is not capable of handling escaped characters. For my application, I needed the ability to escape quotes and spaces (my separator is spaces, but the code is the same).
Here is my solution in Kotlin (the language from this particular application), based on the one from Fabian Steeg:
fun parseString(input: String): List<String> {
val result = mutableListOf<String>()
var inQuotes = false
var inEscape = false
val current = StringBuilder()
for (i in input.indices) {
// If this character is escaped, add it without looking
if (inEscape) {
inEscape = false
current.append(input[i])
continue
}
when (val c = input[i]) {
'\\' -> inEscape = true // escape the next character, \ isn't added to result
',' -> if (inQuotes) {
current.append(c)
} else {
result += current.toString()
current.clear()
}
'"' -> inQuotes = !inQuotes
else -> current.append(c)
}
}
if (current.isNotEmpty()) {
result += current.toString()
}
return result
}
I think this is not a place to use regular expressions. Contrary to other opinions, I don't think a parser is overkill. It's about 20 lines and fairly easy to test.
Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.
After you split on comma, replace all mapped identifiers with the original string values.
I would do something like this:
boolean foundQuote = false;
if(charAtIndex(currentStringIndex) == '"')
{
foundQuote = true;
}
if(foundQuote == true)
{
//do nothing
}
else
{
string[] split = currentString.split(',');
}

Make regex not affecting Quotation mark [duplicate]

I have a string vaguely like this:
foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"
that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)
the above string should split into:
foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"
note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure
Try:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
Output:
> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"
In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.
Or, a bit friendlier for the eyes:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);
String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
which produces the same as the first example.
EDIT
As mentioned by #MikeFHay in the comments:
I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:
Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))
While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
else if (input.charAt(current) == ',' && !inQuotes) {
result.add(input.substring(start, current));
start = current + 1;
}
}
result.add(input.substring(start));
If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
char currentChar = builder.charAt(currentIndex);
if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
if (currentChar == ',' && inQuotes) {
builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
}
}
List<String> result = Arrays.asList(builder.toString().split(","));
http://sourceforge.net/projects/javacsv/
https://github.com/pupi1985/JavaCSV-Reloaded
(fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)
http://opencsv.sourceforge.net/
CSV API for Java
Can you recommend a Java library for reading (and possibly writing) CSV files?
Java lib or app to convert CSV to XML file?
I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as Fabian proposed). I've tried regex solution and own parsing implementation I have found that:
Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
My solution and test below.
String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;
start = System.nanoTime();
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
switch (c) {
case ',':
if (inQuotes) {
b.append(c);
} else {
tokensList.add(b.toString());
b = new StringBuilder();
}
break;
case '\"':
inQuotes = !inQuotes;
default:
b.append(c);
break;
}
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;
System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);
Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.
You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.
If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one
I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):
final static private Pattern splitSearchPattern = Pattern.compile("[\",]");
private List<String> splitByCommasNotInQuotes(String s) {
if (s == null)
return Collections.emptyList();
List<String> list = new ArrayList<String>();
Matcher m = splitSearchPattern.matcher(s);
int pos = 0;
boolean quoteMode = false;
while (m.find())
{
String sep = m.group();
if ("\"".equals(sep))
{
quoteMode = !quoteMode;
}
else if (!quoteMode && ",".equals(sep))
{
int toPos = m.start();
list.add(s.substring(pos, toPos));
pos = m.end();
}
}
if (pos < s.length())
list.add(s.substring(pos));
return list;
}
(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)
Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".
The simplest approach is not to match delimiters, i.e. commas, with a complex additional logic to match what is actually intended (the data which might be quoted strings), just to exclude false delimiters, but rather match the intended data in the first place.
The pattern consists of two alternatives, a quoted string ("[^"]*" or ".*?") or everything up to the next comma ([^,]+). To support empty cells, we have to allow the unquoted item to be empty and to consume the next comma, if any, and use the \\G anchor:
Pattern p = Pattern.compile("\\G\"(.*?)\",?|([^,]*),?");
The pattern also contains two capturing groups to get either, the quoted string’s content or the plain content.
Then, with Java 9, we can get an array as
String[] a = p.matcher(input).results()
.map(m -> m.group(m.start(1)<0? 2: 1))
.toArray(String[]::new);
whereas older Java versions need a loop like
for(Matcher m = p.matcher(input); m.find(); ) {
String token = m.group(m.start(1)<0? 2: 1);
System.out.println("found: "+token);
}
Adding the items to a List or an array is left as an excise to the reader.
For Java 8, you can use the results() implementation of this answer, to do it like the Java 9 solution.
For mixed content with embedded strings, like in the question, you can simply use
Pattern p = Pattern.compile("\\G((\"(.*?)\"|[^,])*),?");
But then, the strings are kept in their quoted form.
what about a one-liner using String.split()?
String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );
A regular expression is not capable of handling escaped characters. For my application, I needed the ability to escape quotes and spaces (my separator is spaces, but the code is the same).
Here is my solution in Kotlin (the language from this particular application), based on the one from Fabian Steeg:
fun parseString(input: String): List<String> {
val result = mutableListOf<String>()
var inQuotes = false
var inEscape = false
val current = StringBuilder()
for (i in input.indices) {
// If this character is escaped, add it without looking
if (inEscape) {
inEscape = false
current.append(input[i])
continue
}
when (val c = input[i]) {
'\\' -> inEscape = true // escape the next character, \ isn't added to result
',' -> if (inQuotes) {
current.append(c)
} else {
result += current.toString()
current.clear()
}
'"' -> inQuotes = !inQuotes
else -> current.append(c)
}
}
if (current.isNotEmpty()) {
result += current.toString()
}
return result
}
I think this is not a place to use regular expressions. Contrary to other opinions, I don't think a parser is overkill. It's about 20 lines and fairly easy to test.
Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.
After you split on comma, replace all mapped identifiers with the original string values.
I would do something like this:
boolean foundQuote = false;
if(charAtIndex(currentStringIndex) == '"')
{
foundQuote = true;
}
if(foundQuote == true)
{
//do nothing
}
else
{
string[] split = currentString.split(',');
}

How to convert hyphen-delimited tag names to camel case

I have a String like -
<phone-residence></phone-residence><marital-status>1</marital-status><phone-on-request></phone-on-request>
I want to remove hyphens (-) and uppercase the single alpha character following each removed hyphen. I.e. convert from hyphen-delimited words to "CamelCase".
Like -
<phoneResidence></phoneResidence><maritalStatus>1</maritalStatus><phoneOnRequest></phoneOnRequest>
How to do this?
Since Java 8 functional interfaces there has been a String#replaceAll() that takes a transformation function to modify the matched subsequences "on the fly" and build the final output.
First, A Warning: Regexes are fantastic, incredibly powerful tools for a certain class of problem. Before applying regex you must determine if the problem is amenable. Ordinarily processing XML is the antithesis of a regex-amenable problem, except in this case where the goal is to treat the input as merely a string and not as XML. (However read carefully the Caveat below)
Here is a famous quote from Jamie Zawinski in 1997:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Solution
With those caveats, here's the code for your question:
String input="<phone-residence></phone-residence><marital-status>1</marital-status><phone-on-request></phone-on-request>";
Matcher m = Pattern.compile("-[a-zA-Z]").matcher(input);
// Do all the replacements in one statement using the functional replaceAll()
String result = m.replaceAll(s -> s.group().substring(1).toUpperCase());
Explanation
The regex matches a single hyphen followed by any single alphabetic character, upper or lowercase. The replaceAll() scans the input using the Matcher. At every match it invokes the lambda (functional shorthand for an anonymous class with a single apply() method) passing in a String argument containing the matched text. Whatever the lambda returns is then substituted into output string being built by the replaceAll() method, in place of the matched string.
Caveat
The solution given above is completely blind to the structure of the XML it will change any -a combination (where a stands for any letter) and replace it with just A (where A stands for an upper-case letter), regardless where it appears.
In the example you gave, this pattern occurred only in the tag names. If however, there are other parts of the file that contain (or can contain) that pattern then those instances will also be replaced. This could be a problem if that pattern occurs in text data (i.e. stuff not inside, but between the tags) or as an attribute value. This approach of applying a regex to the entire file blindly is kind of the chainsaw approach. If you really, really need a chainsaw you use it.
However, if it turns out a chainsaw is too powerful and your actual task requires more finesse, then you would need to switch to a real XML parser (the JDK includes a good one), which can handle all the subtleties. It delivers to you the various syntactic bits and pieces such as tag name, attribute names, attribute values, text, etc. separately, so that you can explicitly decide which parts are to be affected. You'd still use the replaceAll() above but apply it only to the parts where it was needed.
Almost as a rule, you will ABSOLUTELY NOT use regexes to process XML, or parse strings containing nested or escaped quotes, or parse CSV or TSV files. Those data formats are not normally suitable domains for using regexes.
This is very simple, actually. Just read each character of the input string and use a boolean to decide if the character should be added as-is, capitalized, or ignored ():
public class Main {
public static void main(String[] args) {
String input = "<phone-residence></phone-residence><marital-status>1</marital-status><phone-on-request></phone-on-request>";
StringBuilder output = new StringBuilder();
boolean capitalizeNext = false;
for (int i = 0; i < input.length(); i++) {
char thisChar = input.charAt(i);
if (thisChar == '-') {
capitalizeNext = true;
} else if (capitalizeNext) {
output.append(String.valueOf(thisChar).toUpperCase());
capitalizeNext = false;
} else {
output.append(thisChar);
capitalizeNext = false;
}
}
System.out.println(output.toString());
}
}
Output:
<phoneResidence></phoneResidence><maritalStatus>1</maritalStatus><phoneOnRequest></phoneOnRequest>
Same Code w/ Additional Comments:
public class Main {
public static void main(String[] args) {
String input = "<phone-residence></phone-residence><marital-status>1</marital-status><phone-on-request></phone-on-request>";
StringBuilder output = new StringBuilder();
// This is used to determine if the next character should be capitalized
boolean capitalizeNext = false;
// Loop through each character of the input String
for (int i = 0; i < input.length(); i++) {
// Obtain the current character from the String
char thisChar = input.charAt(i);
if (thisChar == '-') {
// If this character is a hyphen, set the capitalizeNext flag, but do NOT add this character to
// the output string (ignore it)
capitalizeNext = true;
} else if (capitalizeNext) {
// The last character was a hyphen, so capitalize this character and add it to the output string
output.append(String.valueOf(thisChar).toUpperCase());
// Reset the boolean so we make a new determination on the next pass
capitalizeNext = false;
} else {
// Just a regular character; add it to the output string as-is
output.append(thisChar);
// Reset the boolean so we make a new determination on the next pass
capitalizeNext = false;
}
}
// Just print the final output
System.out.println(output.toString());
}
}
If you are sure that the values of the elements your XML file do not have any hyphens in them, or if it does not matter if they are affected by the change then you can use the following code:
Code:
String input="<phone-residence></phone-residence><marital-status>1</marital-status><phone-on-request></phone-on-request>";
//this regex will match all letters preceded by a hyphen
Matcher m = Pattern.compile("-[a-zA-Z]").matcher(input);
//use a string builder to manipulate the intermediate strings that are constructed
StringBuilder sb = new StringBuilder();
int last = 0;
//for each match
while (m.find()) {
//append the substring between the last match (or the beginning of the string to the beginning of the current match
sb.append(input.substring(last, m.start()));
//change the case to uppercase of the match
sb.append(m.group(0).toUpperCase());
//set last to the end of the current match
last = m.end();
}
//add the rest of the input string
sb.append(input.substring(last));
//remove all the hyphens and print the string
System.out.println(sb.toString().replaceAll("-", ""));
Output:
<phoneResidence></phoneResidence><maritalStatus>1</maritalStatus><phoneOnRequest></phoneOnRequest>
Improvement:
If you have hyphens in the values of the elements of your XML and you do NOT want them to be affected by this change then you can use the following code (this simplified version does work only if you do not have attributes in your elements (you can add the logic for attributes) and works for small XML trees (you might have to increase the stack size for bigger XML documents to avoid stack overflow errors):
Code:
String input="<contact-root><phone-residence>abc-abc</phone-residence><marital-status>1</marital-status><phone-on-request><empty-node></empty-node></phone-on-request><empty-node/><not-really-empty-node>phone-on-request</not-really-empty-node></contact-root>";
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(input)));
StringBuilder strBuild = new StringBuilder();
xmlTrasversal(doc.getDocumentElement(),-1, strBuild);
System.out.println(input);
System.out.println();
System.out.println(strBuild.toString());
Functions used:
public static String capitalizeNext(String input){
Matcher m = Pattern.compile("-[a-zA-Z]").matcher(input);
StringBuilder sb = new StringBuilder();
int last = 0;
while (m.find()) {
sb.append(input.substring(last, m.start()));
sb.append(m.group(0).toUpperCase());
last = m.end();
}
sb.append(input.substring(last));
return (sb.toString().replaceAll("-", ""));
}
public static void xmlTrasversal(Element e, int depth, StringBuilder strBuild)
{
++depth;
String spaces=" ";
spaces=String.join("", Collections.nCopies(depth, spaces));
if(!e.hasChildNodes())
strBuild.append(spaces+"<"+capitalizeNext(e.getNodeName())+"/>"+System.getProperty("line.separator"));
else if(e.getChildNodes().getLength()==1 && !(e.getChildNodes().item(0) instanceof Element))
{
strBuild.append(spaces+"<"+capitalizeNext(e.getNodeName())+">");
strBuild.append(e.getTextContent());
}
else
{
strBuild.append(spaces+"<"+capitalizeNext(e.getNodeName())+">"+System.getProperty("line.separator"));
}
for (int i=0; i<e.getChildNodes().getLength();i++)
{
if (e.getChildNodes().item(i) instanceof Element) {
xmlTrasversal((Element) e.getChildNodes().item(i), depth, strBuild);
}
}
if(e.getChildNodes().getLength()==1 && !(e.getChildNodes().item(0) instanceof Element))
strBuild.append("</"+capitalizeNext(e.getNodeName())+">"+System.getProperty("line.separator"));
else if(e.hasChildNodes() && (e.getChildNodes().item(0) instanceof Element))
strBuild.append(spaces+"</"+capitalizeNext(e.getNodeName())+">"+System.getProperty("line.separator"));
}
Output for input string:
<contactRoot>
<phoneResidence>abc-abc</phoneResidence>
<maritalStatus>1</maritalStatus>
<phoneOnRequest>
<emptyNode/>
</phoneOnRequest>
<emptyNode/>
<notReallyEmptyNode>phone-on-request</notReallyEmptyNode>
</contactRoot>
Try This:
String str = "<phone-residence></phone-residence><marital-status>1</marital-status><phone-on-request></phone-on-request>";
StringBuilder sb = new StringBuilder();
StringTokenizer stk = new StringTokenizer(str,"-");
while(stk.hasMoreTokens()){
sb.append(WordUtils.capitalize(stk.nextToken()));
}
System.out.println(sb.toString());

Efficently Replacement of all unsupported chars in a String [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Converting Symbols, Accent Letters to English Alphabet
I need to replace all accented characters, such as
"à", "é", "ì", "ò", "ù"
with
"a'", "e'", "i'", "o'", "u'"...
because of an issue with reloading nested strings with accented characters after they've been saved.
Is there a way to do this without using different string replacement for all chars?
For example, I would prefer to avoid doing
text = text.replace("a", "a'");
text2 = text.replace("è", "e'");
text3 = text2.replace("ì", "i'");
text4 = text3.replace("ò", "o'");
text5 = text4.replace("ù", "u'");
etc.
I tried this from this post it seems to work.
String str= Normalizer.normalize(str, Normalizer.Form.NFD);
str= str.replaceAll("\\p{InCombiningDiacriticalMarks}+", "'");
Edit:
But replacing the Combining diacritical marks, has a side effect that you cannot distinguish between À Á Â
If you don't mind adding commons-lang as a dependency, try StringUtils.replaceEach
I believe the following perform the same task:
import org.apache.commons.lang.StringUtils;
public class ReplaceEachTest
{
public static void main(String [] args)
{
String text = "àéìòùàéìòù";
String [] searchList = {"à", "é", "ì", "ò", "ù"};
String [] replaceList = {"a'", "e'", "i'", "o'", "u'"};
String newtext = StringUtils.replaceEach(text, searchList, replaceList);
System.out.println(newtext);
}
}
This example prints a'e'i'o'u'a'e'i'o'u'
However in general I agree that since you are creating a custom character translation, you will need a solution where your explicitly specify the replacement for each character of interest.
My previous answer using replaceChars is no good because it only handles one-to-one character replacement.
After reading the comments in the main approach, I think a better option would be fix the problem - which is encoding related? - and not try to cover up the symptoms.
Also, this still requires a manual explicit mapping, which might make it less ideal than nandeesh's answer with a regular expression unicode character class.
Here is a skeleton for code to perform the mapping. It is slightly more complicated than a char-char.
This code tries to avoid extra Strings. It may or not be "more efficient". Try it with the real data/usage. YMMV.
String mapAccentChar (char ch) {
switch (ch) {
case 'à': return "a'";
// etc
}
return null;
}
String mapAccents (String input) {
StringBuilder sb = new StringBuilder();
int l = input.length();
for (int i = 0; i < l; i++) {
char ch = input.charAt(i);
String mapped = mapAccentChar(ch);
if (mapped != null) {
sb.append(mapped);
} else {
sb.append(ch);
}
return sb.toString();
}
Since there is no strict correlation between ASCII value of a char and its accented version, your replacement seems to me the most straightforward way.

How to remove high-ASCII characters from string like ®, ©, ™ in Java

I want to detect and remove high-ASCII characters like ®, ©, ™ from a String in Java. Is there any open-source library that can do this?
If you need to remove all non-US-ASCII (i.e. outside 0x0-0x7F) characters, you can do something like this:
s = s.replaceAll("[^\\x00-\\x7f]", "");
If you need to filter many strings, it would be better to use a precompiled pattern:
private static final Pattern nonASCII = Pattern.compile("[^\\x00-\\x7f]");
...
s = nonASCII.matcher(s).replaceAll();
And if it's really performance-critical, perhaps Alex Nikolaenkov's suggestion would be better.
I think that you can easily filter your string by hand and check code of the particular character. If it fits your requirements then add it to a StringBuilder and do toString() to it in the end.
public static String filter(String str) {
StringBuilder filtered = new StringBuilder(str.length());
for (int i = 0; i < str.length(); i++) {
char current = str.charAt(i);
if (current >= 0x20 && current <= 0x7e) {
filtered.append(current);
}
}
return filtered.toString();
}
A nice way to do this is to use Google Guava CharMatcher:
String newString = CharMatcher.ASCII.retainFrom(string);
newString will contain only the ASCII characters (code point < 128) from the original string.
This reads more naturally than a regular expression. Regular expressions can take more effort to understand for subsequent readers of your code.
I understand that you need to delete: ç,ã,Ã , but for everybody that need to convert ç,ã,Ã ---> c,a,A please have a look at this piece of code:
Example Code:
final String input = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ";
System.out.println(
Normalizer
.normalize(input, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "")
);
Output:
This is a funky String

Categories