Regex to remove spaces from file name - java

I have some html strings which contains images. I need to remove spaces from image name because some tablets do not accept them. (I already renamed all image resources). I think the only fix part is ...
src="file:///android_asset/images/ ?? ?? .???"
because those links are valid links.
I spent half day on it and still struggling on performance issue. The following code works but really slow...
public static void main(String[] args) {
String str = "<IMG height=286 alt=\"eye_anatomy 1.jpg\" src=\"file:///android_asset/images/eye_anatomy 1 .jpg\" width=350 border=0></P> fd ssda f \r\n"
+ "fd <P align=center><IMG height=286 alt=\"eye_anatomy 1.jpg\" src=\"file:///android_asset/images/ eye_anato my 1 .bmp\" width=350 border=0></P>\r\n"
+ "\r\n<IMG height=286 alt=\"eye_anatomy 1.jpg\" src=\"file:///android_asset/images/eye_anatomy1.png\" width=350 border=0>\r\n";
Pattern p = Pattern.compile("(.*?)(src=\"file:///android_asset/images/)(.*?\\s+.*?)(\")", Pattern.DOTALL);
Matcher m = p.matcher(str);
StringBuilder sb = new StringBuilder("");
int i = 0;
while (m.find()) {
sb.append(m.group(1)).append(m.group(2)).append(m.group(3).replaceAll("\\s+", "")).append(m.group(4));
i = m.end();
}
sb.append(str.substring(i, str.length()));
System.out.println(sb.toString());
}
So the real question is, how can I remove spaces from image name efficiently using regex.
Thank you.

Regex is as regex does. :-) Serious the regex stuff is great for really particular cases, but for stuff like this I find myself writing lower-level code. So the following isn't a regex; it's a function. But it does what you want and does it much faster than your regex. (That said, if someone does comes up with a regex that fits the bill and performs well I'd love to see it.)
The following function segments the source string using spaces as delimiters, then recognizes and cleans up your alt and src attributes by not appending spaces while assembling the result. I did the alt attribute only because you were putting file names there too. One side effect is that this will collapse multiple spaces into one space in the rest of the markup, but browsers do that anyway. You can optimize the code a bit by re-using a StringBuilder. It presumes double-quotes around attributes.
I hope this helps.
private String removeAttrSpaces(final String str) {
final StringBuilder sb = new StringBuilder(str.length());
boolean inAttribute = false;
for (final String segment : str.split(" ")) {
if (segment.startsWith("alt=\"") || segment.startsWith("src=\"")) {
inAttribute = true;
}
if (inAttribute && segment.endsWith("\"")) {
inAttribute = false;
}
sb.append(segment);
if (!inAttribute) {
sb.append(' ');
}
}
return sb.toString();
}

Here's a function that should be faster http://ideone.com/vlspF:
private static String removeSpacesFromImages(String aText){
Pattern p = Pattern.compile("(?<=src=\"file:///android_asset/images/)[^\"]*");
StringBuffer result = new StringBuffer();
Matcher matcher = p.matcher(aText);
while ( matcher.find() ) {
matcher.appendReplacement(result, matcher.group(0).replaceAll("\\s+",""));
}
matcher.appendTail(result);
return result.toString();
}

Related

Underscore to camel case except for certain prefixes

I am currently creating a Java program to rewrite some outdated Java classes in our software. Part of the conversion includes changing variable names from containing underscores to using camelCase instead. The problem is, I cannot simply replace all underscores in the code. We have some classes with constants and for those, the underscore should remain.
How can I replace instances like string_label with stringLabel, but DO NOT replace underscores that occur after the prefix "Parameters."?
I am currently using the following which obviously does not handle excluding certain prefixes:
public String stripUnderscores(String line) {
Pattern p = Pattern.compile("_(.)");
Matcher m = p.matcher(line);
StringBuffer sb = new StringBuffer();
while(m.find()) {
m.appendReplacement(sb, m.group(1).toUpperCase());
}
m.appendTail(sb);
return sb.toString();
}
You could possibly try something like:
Pattern.compile("(?<!(class\\s+Parameters.+|Parameters\\.[\\w_]+))_(.)")
which uses a negative lookbehind.
You would probably be better served using some kind of refactoring tool that understood scoping semantics.
If all you check for is a qualified name like Parameters.is_module_installed then you will replace
class Parameters {
static boolean is_module_installed;
}
by mistake. And there are more corner cases like this. (import static Parameters.*;, etc., etc.)
Using regular expressions alone seems troublesome to me. One way you can make the routine smarter is to use regex just to capture an expression of identifiers and then you can examine it separately:
static List<String> exclude = Arrays.asList("Parameters");
static String getReplacement(String in) {
for(String ex : exclude) {
if(in.startsWith(ex + "."))
return in;
}
StringBuffer b = new StringBuffer();
Matcher m = Pattern.compile("_(.)").matcher(in);
while(m.find()) {
m.appendReplacement(b, m.group(1).toUpperCase());
}
m.appendTail(b);
return b.toString();
}
static String stripUnderscores(String line) {
Pattern p = Pattern.compile("([_$\\w][_$\\w\\d]+\\.?)+");
Matcher m = p.matcher(line);
StringBuffer sb = new StringBuffer();
while(m.find()) {
m.appendReplacement(sb, getReplacement(m.group()));
}
m.appendTail(sb);
return sb.toString();
}
But that will still fail for e.g. class Parameters { is_module_installed; }.
It could be made more robust by further breaking down each expression:
static String getReplacement(String in) {
if(in.contains(".")) {
StringBuilder result = new StringBuilder();
String[] parts = in.split("\\.");
for(int i = 0; i < parts.length; ++i) {
if(i > 0) {
result.append(".");
}
String part = parts[i];
if(i == 0 || !exclude.contains(parts[i - 1])) {
part = getReplacement(part);
}
result.append(part);
}
return result.toString();
}
StringBuffer b = new StringBuffer();
Matcher m = Pattern.compile("_(.)").matcher(in);
while(m.find()) {
m.appendReplacement(b, m.group(1).toUpperCase());
}
m.appendTail(b);
return b.toString();
}
That would handle a situation like
Parameters.a_b.Parameters.a_b.c_d
and output
Parameters.a_b.Parameters.a_b.cD
That's impossible Java syntax but I hope you see what I mean. Doing a little parsing yourself goes a long way.
Maybe you can have another Pattern:
Pattern p = Pattern.compile("^Parameters.*"); //^ means the beginning of a line
If this matches , don't replace anything.

How to add the html tags but still keep the spaces intact?

I am working on interview problems from http://www.glassdoor.com/Interview/Indeed-Software-Engineer-Intern-Interview-Questions-EI_IE100561.0,6_KO7,31.htm
The current problem I am doing is "The second question is searching a particular word in a string, and add "<b>" "<\b>" around the word's every appearance."
Here's my code:
public class AddBsAround {
public static void main(String[] args) {
String testCase = "Don't you love it when you install all software and all programs";
System.out.println(addBs(testCase, "all"));
}
public static String addBs(String sentence, String word) {
String result = "";
String[] words = sentence.trim().split("\\s+");
for(String wordInSentence: words) {
if(wordInSentence.equals(word)) {
result += "<b>" +word + "</b> ";
} else {
result += wordInSentence + " ";
}
}
return result;
}
}
The code produces essentially the correct output; that is, when passed in the testcase, it produces
Don't you love it when you install <b>all</b> software and <b>all</b> programs
, avoiding the bug that the original author had, in that for search of "all" in "install", his code would produce "install".
However would the spaces be an issue though? When pass in
"Don't you love it "
, my code will produce "Don't you love it", or basically the sentence with just one space in between the words. Do you guys see this as an issue? I kinda do because the client might not expect this method to alter spaces. Would there be a workaround around this? I felt like I needed to use the regex to separate the words.
Rather than splitting on \\s+, split on \\s -- that way, it splits on every single space instead of every group of them, and when you put them back together, the amount of spaces is preserved. The difference is that + tells the regex to split on one or more spaces, but without it, it's exactly a single one.
Aside from that, I'd recommend also using a StringBuilder to join the strings, since it's more efficient for very long ones, and you want to be the best possible, right?
It's just the one character change, but for the sake of completeness, this is your new method:
public static String addBs(String sentence, String word) {
StringBuilder result = new StringBuilder();
String[] words = sentence.trim().split("\\s");
for(String wordInSentence: words) {
if(wordInSentence.equals(word)) {
result.append("<b>").append(word).append("</b> ");
} else {
result.append(wordInSentence).append(" ");
}
}
return result.toString();
}
}
The result, using this code, is this:
Don't you love it when you install <b>all</b> software and <b>all</b> programs
You can use lookarounds in regex:
public static String addBs(String sentence, String word) {
String result = "";
String[] words = sentence.split("(?<!\\s)(?=\\s)");
for(String wordInSentence: words) {
if(wordInSentence.trim().equals(word)) {
result += "<b>" +word + "</b> ";
} else {
result += wordInSentence + " ";
}
}
return result;
}
Output:
Don't you love it when you install <b>all</b> software and <b>all</b> programs
(?<!\\s) is a negative lookbehind which means the preceding character is not a space and (?=\\s) is a positive lookahead which means the following character is a space. See regex demo here.
As other people suggested, splitting using single space would be better. Just to approach it in a different way, try Java's pattern.
public static String addBs(String sentence, String word) {
Pattern pattern = Pattern.Compile(word);
Matcher m = pattern.matcher(sentence);
return(m.replaceAll("<b>" + word + "</b>"));
}

Replacing regex with the same amount of "." as its length

See this for my current attempt: http://regexr.com?374vg
I have a regex that captures what I want it to capture, the thing is that the String().replaceAll("regex", ".") replaces everything with just one ., which is fine if it's at the end of the line, but otherwise it doesn't work.
How can I replace every character of the match with a dot, so I get the same amount of . symbols as its length?
Here's a one line solution:
str = str.replaceAll("(?<=COG-\\d{0,99})\\d", ".").replaceAll("COG-(?=\\.+)", "....");
Here's some test code:
String str = "foo bar COG-2134 baz";
str = str.replaceAll("(?<=COG-\\d{0,99})\\d", ".").replaceAll("COG-(?=\\.+)", "....");
System.out.println(str);
Output:
foo bar ........ baz
This is not possible using String#replaceAll. You might be able to use Pattern.compile(regexp) and iterate over the matches like so:
StringBuilder result = new StringBuilder();
Pattern pattern = Pattern.compile(regexp);
Matcher matcher = pattern.matcher(inputString);
int previous = 0;
while (matcher.find()) {
result.append(inputString.substring(previous, matcher.start()));
result.append(buildStringWithDots(matcher.end() - matcher.start()));
previous = matcher.end();
}
result.append(inputString.substring(previous, inputString.length()));
To use this you have to define buildStringWithDots(int length) to build a String containing length dots.
Consider this code:
Pattern p = Pattern.compile("COG-([0-9]+)");
Matcher mt = p.matcher("Fixed. Added ''Show annualized values' chackbox in EF Comp Report. Also fixed the problem with the missing dots for the positions and the problem, described in COG-18613");
if (mt.find()) {
char[] array = new char[mt.group().length()];
Arrays.fill(array, '.');
System.out.println( " <=> " + mt.replaceAll(new String(array)));
}
OUTPUT:
Fixed. Added ''Show annualized values' chackbox in EF Comp Report. Also fixed the problem with the missing dots for the positions and the problem, described in .........
Personally, I'd simplify your life and just do something like this (for starters). I'll let you finish.
public class Test {
public static void main(String[] args) {
String cog = "COG-19708";
for (int i = cog.indexOf("COG-"); i < cog.length(); i++) {
System.out.println(cog.substring(i,i+1));
// build new string
}
}
}
Can you put your regex in grouping so replace it with string that matches the length of matched grouping? Something like:
regex = (_what_i_want_to_match)
String().replaceAll(regex, create string that has that many '.' as length of $1)
?
note: $1 is what you matched in your search
see also: http://www.regular-expressions.info/brackets.html

Java regular expression for repeated letters

I can't find a regex that matches repeated letters. My problem is that I want to use regex to filter out spam-mails, for example, I want to use regex to detect "spam" and "viagra" in these strings :
"xxxSpAmyyy",
"xxxSPAMyyy",
"xxxvI a Gr AA yyy",
"xxxV iiA gR a xxx"
Do You have any suggestions how I do that in a good way?
This ignores the case, and it takes them whether they are one next to another, or there are other characters in between them
"(?i).{0,}v.{0,}i.{0,}a.{0,}g.{0,}r.{0,}a.{0,}"
If you know how many characters can be between the letters, you can enter .{0,max_distance} instead of .{0,}
UPDATE:
It works even for duplicates, as i have tried it:
String str = "xxxV iiA gR a xxx";
if(str.matches("(?i).{0,}v.{0,}i.{0,}a.{0,}g.{0,}r.{0,}a.{0,}")){
System.out.println("Yes");
}
else{
System.out.println("No");
}
This prints Yes
I think, you're on wrong way. Filtering of spam is closely related to machine learning. I'd suggest you to read about Bayesian spam filtering.
If you suppose, that you'll get spam mails with misspelled words (and other kind of garbage) - I'd suggest to use filtering based not on entire words, but on n-grams.
Like searching this?
"v.{0,3}i.{0,3}a.{0,3}g.{0,3}r.{0,3}a"
See Pattern
Code:
This leaves space for 0 to 3 characters between characters. I did not compile the following,
but it "should work."
String[] strings = new String[] { ""xxxV iiA gR a xxx"" };
final Pattern spamPattern = makePattern("viagra");
for (String s : strings) {
boolean isSpam = spamPattern.matcher(s).find();
if (isSpam) {
System.out.println("Spam: " + s);
}
}
...
Pattern makePattern(String cusWord) {
cusWord = cusWord.toLowerCase();
StringBuilder sb = new StringBuilder();
sb.append("(?i)"); // Case-insensitive setting.
for (int i = 0; i < cusWord.length(); ) {
int cp = cusWord.codePointAt(i);
i += Character.charCount(cp);
if ('o' == cp) {
sb.append("[o0]");
} else if ('l' == cp) {
sb.append("[l1]");
} else {
sb.appendCodePoint(cp);
}
sb.append(".{0,3}"); // 0 - 3 occurrences of any char.
}
return Pattern.compile(sb.toString());
}
You could try using positive look-aheads
(?=.*v)(?=.*i)(?=.*a)(?=.*g)(?=.*r)(?=.*a).*
Edit:
(?=.*v.*i.*a.*g.*r.*a.*).*
Did you try any regex?
Something like \w*[sSpPaAmM]+\w* should do the trick
You can test your RE on this site : http://www.regexplanet.com/advanced/java/index.html

What would be the most efficient way of performing text substitution on this collection?

Imagine you have a List<String> collection, which can contain tens of thousands of Strings.
If some of them are in the format of:
"This is ${0}, he likes ${1},${2} ... ${n}"
What would be the most efficient way ( performance-wise ) to transform a string like the one above to:
"This is %1, he likes %2,%3 ... %n"
Note that the % way starts from 1. Here's my solution:
import java.util.regex.*;
...
String str = "I am ${0}. He is ${1}";
Pattern pat = Pattern.compile("\\\$\\{(\\d+)\\}");
Matcher mat = pat.matcher(str)
while(mat.find()) {
str = mat.replaceFirst("%"+(Integer.parseInt(mat.group(1))+1))
mat = pat.matcher(str);
}
System.out.println(str);
I hope it's valid Java code, I just wrote it now in a GroovyConsole.
I'm interested in more efficient solutions, since I'm thinking that applying so many regex substitutions on so many strings might be too slow. The end code will run as Java code not Groovy code, I just used Groovy for quick prototyping :)
Here's how I would do it:
import java.util.regex.*;
public class Test
{
static final Pattern PH_Pattern = Pattern.compile("\\$\\{(\\d++)\\}");
static String changePlaceholders(String orig)
{
Matcher m = PH_Pattern.matcher(orig);
if (m.find())
{
StringBuffer sb = new StringBuffer(orig.length());
do {
m.appendReplacement(sb, "");
sb.append("%").append(Integer.parseInt(m.group(1)) + 1);
} while (m.find());
m.appendTail(sb);
return sb.toString();
}
return orig;
}
public static void main (String[] args) throws Exception
{
String s = "I am ${0}. He is ${1}";
System.out.printf("before: %s%nafter: %s%n", s, changePlaceholders(s));
}
}
test it at ideone.com
appendReplacement() performs two major functions: it appends whatever text lay between the previous match and the current one; and it parses the replacement string for group references and inserts the captured text in their place. We don't need the second function, so we bypass it by feeding it an empty replacement string. Then we call StringBuffer's append() method ourselves with the generated replacement text.
In Java 7, this API will be opened up a bit more, making further optimizations possible. The appendReplacement() functionality will be broken out into separate methods, and we'll be able to use StringBuilders instead of StringBuffers (StringBuilder didn't exist yet when Pattern/Matcher were introduced in JDK 1.4).
But probably the most effective optimization is compiling the Pattern once and saving it in a static final variable.
You should begin your match from the last checked index of the string instead of the first index at each iterative step. As btilly alludes in a comment, your solution is O(n^2) where it should be O(n). To avoid unnecessary string copying, use a StringBuilder instead:
StringBuilder str = new StringBuilder("I am ${0}. He is ${1}");
Pattern pat = Pattern.compile("\\\$\\{(\\d+)\\}");
Matcher mat = pat.matcher(str);
int lastIdx = 0;
while (mat.find(lastIdx)) {
String group = mat.group(1);
str.replace(mat.start(1), mat.end(1), "%"+(Integer.parseInt(group)+1));
lastIdx = mat.start(1);
}
System.out.println(str);
Code is untested so there might be some off-by-one errors.
I think it would be more efficient to use appendReplacement since then you aren't making a ton of new String objects and the search doesn't resume from the beginning each time.
String str = "I am ${0}. He is ${1}";
Pattern pat = Pattern.compile("\\$\\{(\\d+)\\}");
Matcher mat = pat.matcher(str);
StringBuffer sb = new StringBuffer(str.length());
while (mat.find()) {
mat.appendReplacement(sb, "" + Integer.parseInt(mat.group(1)));
}
mat.appendTail(sb);
System.out.println(sb.toString());
Prints:
I am 0. He is 1
Try this:
String str = "I am ${0}. He is ${1}";
Pattern pat = Pattern.compile("\\$\\{(\\d+)\\}");
Matcher mat = pat.matcher(string);
StringBuffer output = new StringBuilder(string.length());
while(mat.find()) {
m.appendReplacement(output, "%"+(Integer.parseInt(mat.group(1))+1));
}
mat.appendTail(output);
System.out.println(output);
(Copied mainly from the Javadoc, with the added transformation from the question.)
I think this is really O(n).

Categories