Another problem I try to solve (NOTE this is not a homework but what popped into my head), I'm trying to improve my problem-solving skills in Java. I want to display this:
Students ID #
Carol McKane 920 11
James Eriol 154 10
Elainee Black 462 12
What I want to do is on the 3rd column, display the number of characters without counting the spaces. Give me some tips to do this. Or point me to Java's robust APIs, cause I'm not yet that familiar with Java's string APIs. Thanks.
It sounds like you just want something like:
public static int countNonSpaces(String text) {
int count = 0;
for (int i = 0; i < text.length(); i++) {
if (text.charAt(i) != ' ') {
count++;
}
}
return count;
}
You may want to modify this to use Character.isWhitespace instead of only checking for ' '. Also note that this will count pairs outside the Basic Multilingual Plane as two characters. Whether that will be a problem for you or not depends on your use case...
Think of solving a problem and presenting the answer as two very different steps. I won't help you with the presentation in a table, but to count the number of characters in a String (without spaces) you can use this:
String name = "Carol McKane";
int numberOfCharacters = name.replaceAll("\\s", "").length();
The regular expression \\s matches all whitespace characters in the name string, and replaces them with "", or nothing.
Probably the shortest and easiest way:
String[][] students = { { "Carol McKane", "James Eriol", "Elainee Black" }, { "920", "154", "462" } };
for (int i = 0 ; i < students[0].length; i++) {
System.out.println(students[0][i] + "\t" + students[1][i] + "\t" + students[0][i].replace( " ", "" ).length() );
}
replace(), replaces each substring (" ") of your string and removes it from the result returned, from this temporal string, without spaces, you can get the length by calling length() on it...
The String name will remain unchanged.
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html
cheers
To learn more about it you should watch the API documentation for String and Character
Here some examples how to do:
// variation 1
int count1 = 0;
for (char character : text.toCharArray()) {
if (Character.isLetter(character)) {
count1++;
}
}
This uses a special short from of "for" instruction. Here's the long form for better understanding:
// variation 2
int count2 = 0;
for (int i = 0; i < text.length(); i++) {
char character = text.charAt(i);
if (Character.isLetter(character)) {
count2++;
}
}
BTW, removing whitespaces via replace method is not a good coding style to me and not quite helpful for understanding how string class works.
Related
I have over a gigabyte of text that I need to go through and surround punctuation with spaces (tokenizing). I have a long regular expression (1818 characters, though that's mostly lists) that defines when punctuation should not be separated. Being long and complicated makes it hard to use groups with it, though I wouldn't leave that out as an option since I could make most groups non-capturing (?:).
Question: How can I efficiently replace certain characters that don't match a particular regular expression?
I've looked into using lookaheads or similar, and I haven't quite figured it out, but it seems to be terribly inefficient anyway. It would likely be better than using placeholders though.
I can't seem to find a good "replace with a bunch of different regular expressions for both finding and replacing in one pass" function.
Should I do this line by line instead of operating on the whole text?
String completeRegex = "[^\\w](("+protectedPrefixes+")|(("+protectedNumericOnly+")\\s*\\p{N}))|"+protectedRegex;
Matcher protectedM = Pattern.compile(completeRegex).matcher(s);
ArrayList<String> protectedStrs = new ArrayList<String>();
//Take note of the protected matches.
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
}
//Replace protected matches.
String replaceStr = "<PROTECTED>";
s = protectedM.replaceAll(replaceStr);
//Now that it's safe, separate punctuation.
s = s.replaceAll("([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])"," $1 ");
// These are for apostrophes. Can these be combined with either the protecting regular expression or the one above?
s = s.replaceAll("([\\p{N}\\p{L}])'(\\p{L})", "$1 '$2");
s = s.replaceAll("([^\\p{L}])'([^\\p{L}])", "$1 ' $2");
Note the two additional replacements for apostrophes. Using placeholders protects against those replacements as well, but I'm not really concerned with apostrophes or single quotes in my protecting regex anyway, so it's not a real concern.
I'm rewriting what I considered very inefficient Perl code with my own in Java, keeping track of speed, and things were going fine until I started replacing the placeholders with the original strings. With that addition it's too slow to be reasonable (I've never seen it get even close to finishing).
//Replace placeholders with original text.
String resultStr = "";
String currentStr = "";
int currentPos = 0;
int[] protectedArray = replaceStr.codePoints().toArray();
int protectedLen = protectedArray.length;
int[] strArray = s.codePoints().toArray();
int protectedCount = 0;
for (int i=0; i<strArray.length; i++) {
int pt = strArray[i];
// System.out.println("pt: "+pt+" symbol: "+String.valueOf(Character.toChars(pt)));
if (protectedArray[currentPos]==pt) {
if (currentPos == protectedLen - 1) {
resultStr += protectedStrs.get(protectedCount);
protectedCount++;
currentPos = 0;
} else {
currentPos++;
}
} else {
if (currentPos > 0) {
resultStr += replaceStr.substring(0, currentPos);
currentPos = 0;
currentStr = "";
}
resultStr += ParseUtils.getSymbol(pt);
}
}
s = resultStr;
This code may not be the most efficient way to return the protected matches. What is a better way? Or better yet, how can I replace punctuation without having to use placeholders?
I don't know exactly how big your in-between strings are, but I suspect that you can do somewhat better than using Matcher.replaceAll, speed-wise.
You're doing 3 passes across the string, each time creating a new Matcher instance, and then creating a new String; and because you're using + to concatenate the strings, you're creating a new string which is the concatenation of the in-between string and the protected group, and then another string when you concatenate this to the current result. You don't really need all of these extra instances.
Firstly, you should accumulate the resultStr in a StringBuilder, rather than via direct string concatenation. Then you can proceed something like:
StringBuilder resultStr = new StringBuilder();
int currIndex = 0;
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
appendInBetween(resultStr, str, current, protectedM.str());
resultStr.append(protectedM.group());
currIndex = protectedM.end();
}
resultStr.append(str, currIndex, str.length());
where appendInBetween is a method implementing the equivalent to the replacements, just in a single pass:
void appendInBetween(StringBuilder resultStr, String s, int start, int end) {
// Pass the whole input string and the bounds, rather than taking a substring.
// Allocate roughly enough space up-front.
resultStr.ensureCapacity(resultStr.length() + end - start);
for (int i = start; i < end; ++i) {
char c = s.charAt(i);
// Check if c matches "([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])".
if (!(Character.isLetter(c)
|| Character.isDigit(c)
|| Character.getType(c) == Character.NON_SPACING_MARK
|| "_\\-<>'".indexOf(c) != -1)) {
resultStr.append(' ');
resultStr.append(c);
resultStr.append(' ');
} else if (c == '\'' && i > 0 && i + 1 < s.length()) {
// We have a quote that's not at the beginning or end.
// Call these 3 characters bcd, where c is the quote.
char b = s.charAt(i - 1);
char d = s.charAt(i + 1);
if ((Character.isDigit(b) || Character.isLetter(b)) && Character.isLetter(d)) {
// If the 3 chars match "([\\p{N}\\p{L}])'(\\p{L})"
resultStr.append(' ');
resultStr.append(c);
} else if (!Character.isLetter(b) && !Character.isLetter(d)) {
// If the 3 chars match "([^\\p{L}])'([^\\p{L}])"
resultStr.append(' ');
resultStr.append(c);
resultStr.append(' ');
} else {
resultStr.append(c);
}
} else {
// Everything else, just append.
resultStr.append(c);
}
}
}
Ideone demo
Obviously, there is a maintenance cost associated with this code - it is undeniably more verbose. But the advantage of doing it explicitly like this (aside from the fact it is just a single pass) is that you can debug the code like any other - rather than it just being the black box that regexes are.
I'd be interested to know if this works any faster for you!
At first I thought that appendReplacement wasn't what I was looking for, but indeed it was. Since it's replacing the placeholders at the end that slowed things down, all I really needed was a way to dynamically replace matches:
StringBuffer replacedBuff = new StringBuffer();
Matcher replaceM = Pattern.compile(replaceStr).matcher(s);
int index = 0;
while (replaceM.find()) {
replaceM.appendReplacement(replacedBuff, "");
replacedBuff.append(protectedStrs.get(index));
index++;
}
replaceM.appendTail(replacedBuff);
s = replacedBuff.toString();
Reference: Second answer at this question.
Another option to consider:
During the first pass through the String, to find the protected Strings, take the start and end indices of each match, replace the punctuation for everything outside of the match, add the matched String, and then keep going. This takes away the need to write a String with placeholders, and requires only one pass through the entire String. It does, however, require many separate small replacement operations. (By the way, be sure to compile the patterns before the loop, as opposed to using String.replaceAll()). A similar alternative is to add the unprotected substrings together, and then replace them all at the same time. However, the protected strings would then have to be added to the replaced string at the end, so I doubt this would save time.
int currIndex = 0;
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
String substr = s.substring(currIndex,protectedM.start());
substr = p1.matcher(substr).replaceAll(" $1 ");
substr = p2.matcher(substr).replaceAll("$1 '$2");
substr = p3.matcher(substr).replaceAll("$1 ' $2");
resultStr += substr+protectedM.group();
currIndex = protectedM.end();
}
Speed comparison for 100,000 lines of text:
Original Perl script: 272.960579875 seconds
My first attempt: Too long to finish.
With appendReplacement(): 14.245160866 seconds
Replacing while finding protected: 68.691842962 seconds
Thank you, Java, for not letting me down.
I'm searching for a way to delete each 4th occurrence of a character (a-zA-Z) in a row.
For example, if I have the following string:
helloooo I am veeeeeeeeery busy right nowww because I am working veeeeeery hard
I want delete all 4th, 5th, 6th, ... characters in a row. But, in the word hard, a 4th r occurs, which I do NOT want to delete, because it is not the 4th r in a row / it is surrounded with other characters. The result should be:
hellooo I am veeery busy right nowww because I am working veeery hard
I have already searched for a way to do this, and I could have found a way to replace/delete the 4th occurrence of a character, but I could not find a way to replace/delete the 4th occurrence of a character in a row.
Thanks in advance.
The function may be written like this:
public static String transform(String input) {
if (input.isEmpty()) {
return input;
} else {
final StringBuilder sb = new StringBuilder();
char lastChar = '\0';
int duplicates = 0;
for (int i = 0; i < input.length(); i++) {
final char curChar = input.charAt(i);
if (curChar == lastChar) {
duplicates++;
if (duplicates < 3) {
sb.append(curChar);
}
} else {
sb.append(curChar);
lastChar = curChar;
duplicates = 0;
}
}
return sb.toString();
}
}
I think it's faster than regex.
In Java you can use this replacement based on back-references:
str = str.replaceAll("(([a-zA-Z])\\2\\2)\\2+", "$1");
Code Demo
RegEx Demo
The regex you want is ((.)\2{2})\2*. Not quite sure what that is in Java-ese, but what it does is match any single character and then 2 additional instances of that character, followed by any number of additional instances. Then replace it with the contents of the first capture group (\1) and you're good to go.
Is there a way to check if a substring contains an entire WORD, and not a substring.
Envision the following scenario:
public class Test {
public static void main(String[] args) {
String[] text = {"this is a", "banana"};
String search = "a";
int counter = 0;
for(int i = 0; i < text.length; i++) {
if(text[i].toLowerCase().contains(search)) {
counter++;
}
}
System.out.println("Counter was " + counter);
}
}
This evaluates to
Counter was 2
Which is not what I'm looking for, as there is only one instance of the word 'a' in the array.
The way I read it is as follows:
The if-test finds an 'a' in text[0], the 'a' corresponding to "this is [a]". However, it also finds occurrences of 'a' in "banana", and thus increments the counter.
How can I solve this to only include the WORD 'a', and not substrings containing a?
Thanks!
You could use a regex, using Pattern.quote to escape out any special characters.
String regex = ".*\\b" + Pattern.quote(search) + "\\b.*"; // \b is a word boundary
int counter = 0;
for(int i = 0; i < text.length; i++) {
if(text[i].toLowerCase().matches(regex)) {
counter++;
}
}
Note this will also find "a" in "this is a; pause" or "Looking for an a?" where a doesn't have a space after it.
Could try this way:
for(int i = 0; i < text.length; i++) {
String[] words = text[i].split("\\s+");
for (String word : words)
if(word.equalsIgnoreCase(search)) {
counter++;
break;
}
}
If the words are separated by a space, then you can do:
if((" "+text[i].toLowerCase()+" ").contains(" "+search+" "))
{
...
}
This adds two spaces to the original String.
eg: "this is a" becomes " this is a ".
Then it searches for the word, with the flanking spaces.
eg: It searches for " a " when search is "a"
Arrays.asList("this is a banana".split(" ")).stream().filter((s) -> s.equals("a")).count();
Of course, as others have written, you can start playing around with all kinds of pattern to match "words" out of "text".
But the thing is: depending on the underlying problem you have to solve, this might (by far) not good enough. Meaning: are you facing the problem of finding some pattern in some string ... or is it really, that you want to interpret that text in the "human language" sense? You know, when somebody writes down text, there might be subtle typos, strange characters; all kind of stuff that make it hard to really "find" a certain word in that text. Unless you dive into the "language processing" aspect of things.
Long story short: if your job is "locate certain patterns in strings"; then all the other answers will do. But if your requirement goes beyond that, like "some human will be using your application to 'search' huge data sets"; then you better stop now; and consider turning to full-text enabled search engines like ElasticSearch or Solr.
I've got a JSON mapping all of the unicode emojis to a colon separated string representation of them (like twitter uses). I've imported the file into an ArrayList of Pair< Character, String> and now need to scan a String message and replace any unicode emojis with their string equivalents.
My code for conversion is the following:
public static String getStringFromUnicode(Context context, String m) {
ArrayList<Pair<Character, String>> list = loadEmojis(context);
String formattedString="";
for (Pair p : list) {
formattedString = message.replaceAll(String.valueOf(p.first), ":" + p.second + ":");
}
return formattedString;
}
but I always get the unicode emoji representation when I send the message to a server.
Any help would be greatly appreciated, thanks!!
When in doubt go back to first principles.
You have a lot of stuff that is all nested together. I have found in such cases that your best approach to solving the problem is to pull it apart and look at what the different pieces are doing. This lets you take control of the problem, and place test code where needed to see what the data is doing.
My best guess is that replaceAll() is acting unpredictably; misinterpreting the emoji string as commands for its regular expression analysis.
I would suggest substituting replaceAll() with a loop of your own that does the same thing. Since we are working with Unicode I would suggest going down deep on this one. This little code sample will do the same thing as replace all, but because I am addressing the string on a character by character basis it should work no matter what funny controls codes are in the string.
String message = "This :-) is a test :-) message";
String find = ":-)";
String replace = "!";
int pos = 0;
//Replicates function of replaceAll without the regular expression analysis
pos = subPos(message,find);
while (pos != -1)
{
String tmp = message.substring(0,pos);
tmp = tmp + replace;
tmp = tmp + message.substring(pos+find.length());
message = tmp;
pos = subPos(message,find);
}
System.out.println(message);
-- Snip --
//Replicates function of indexOf
public static int subPos(String str, String sub)
{
for (int i = 0; i < str.length() - (sub.length() - 1); i++)
{
int j;
for (j = 0; j < sub.length(); j++)
{
System.out.println(i + j);
if (str.charAt(i + j) != sub.charAt(j))
break;
}
if (j == sub.length())
return i;
}
return -1;
}
I hope this helps. :-)
String a="(Yeahhhh) I have finally made it to the (top)";
Given above String, there are 4 of '(' and ')' altogether.
My idea of counting that is by utilizing String.charAt method. However, this method is rather slow as I have to perform this counting for each string for at least 10000 times due to the nature of my project.
Anyone has any better idea or suggestion than using .chartAt method?????
Sorry for not explaining clearly earlier on, what I meant for the 10000 times is for the 10000 sentences to be analyzed which is the above String a as only one sentence.
StringUtils.countMatches(wholeString, searchedString) (from commons-lang)
searchedString may be one-char - "("
It (as noted in the comments) is calling charAt(..) multiple times. However, what is the complexity? Well, its O(n) - charAt(..) has complexity O(1), so I don't understand why do you find it slow.
Sounds like homework, so I'll try to keep it at the "nudge in the right direction".
What if you removed all characters NOT the character you are looking for, and look at the length of that string?
There is a String method that will help you with this.
You can use toCharArray() once and iterate over that. It might be faster.
Why do you need to do this 10000 times per String? Why don't you simply remember the result of the first time? This would save a lot more than speeding up a single counting.
You can achieve this by following method.
This method would return a map with key as the character and value as its occurence in input string.
Map countMap = new HashMap();
public void updateCountMap(String inStr, Map<Character, Integer> countMap)
{
char[] chars = inStr.toCharArray();
for(int i=0;i<chars.length;i++)
{
if(!countMap.containsKey(chars[i]))
{
countMap.put(chars[i], 1);
}
countMap.put(chars[i] ,countMap.get(chars[i])+1);
}
return countMap;
}
What we can do is read the file line by line and calling the above method for every line. Each time the map would keep adding the values(number of occurences) for characters. Thus, the Character array size would never be too long and we achieve what we need.
Advantage:
Single iteration over the input string's characters.
Character array size never grows to high limits.
Result map contains occurences for each character.
Cheers
You could do that with Regular Expressions:
Pattern pattern = Pattern.compile("[\\(\\)]"); //Pattern says either '(' or ')'
Matcher matcher = pattern.matcher("(Yeahhhh) I have finally made it to the (top)");
int count = 0;
while (matcher.find()) { //call find until nothing is found anymore
count++;
}
System.out.println("count "+count);
The Pro is, that the Patterns are very flexible. You could also search for embraced words: "\\(\\w+\\)" (A '(' followed by one or more word characters, followed by ')')
The Con is, that it may be like breaking a fly on the wheel for very simple cases
See the Javadoc of Pattern for more details on Regular Expressions
I tested the following methods for 10M strings to count "," symbol.
// split a string by ","
public static int nof1(String s)
{
int n = 0;
if (s.indexOf(',') > -1)
n = s.split(",").length - 1;
return n;
} // end method nof1
// count "," using char[]
public static int nof2(String s)
{
char[] C = s.toCharArray();
int n = 0;
for (char c : C)
{
if (c == ',')
n++;
} // end for c
return n;
} // end method nof2
// replace "," and calculate difference in length
public static int nof3(String s)
{
String s2 = s.replaceAll(",", "");
return s.length() - s2.length();
} // end method nof3
// count "," using charAt
public static int nof4(String s)
{
int n = 0;
for(int i = 0; i < s.length(); i++)
{
if (',' == s.charAt(i) )
n++;
} // end for i
return n;
} // end method nof4
// count "," using Pattern
public static int nof5(String s)
{
// Pattern pattern = Pattern.compile(","); // compiled outside the method
Matcher matcher = pattern.matcher(s);
int n = 0;
while (matcher.find() )
{
n++;
}
return n;
} // end method nof5
The results:
nof1: 4538 ms
nof2: 474 ms
nof3: 4357 ms
nof4: 357 ms
nof5: 1780 ms
So, charAt is the fastest one. BTW, grep -o ',' | wc -l took 7402 ms.