Double "Pipes" in title - java

At my job today, I was made aware of a little error in our pages' titles. Our site is built using .jsp pages and for the titles of our product pages we use
In our admin (where we can set up the titles for each of the products), we would normally add in * anyone ever run into this issue before, and if so, does anyone know of a way to fix the double pipes issue I have encountered?

Problem is that the method replaceAll has as the first argument regular expression. The "|" is reserved symbol in regular expressions and you must escape it if you want use it as a string literal. You can create workaround, for example this way.
String[] words = str.split(" ");
for (int i = 0; i < words.length; i++) {
if (words[i].length() > 0) {
if (!(words[i].substring(0, 1).equals("|"))) {
sb.append(words[i].replaceFirst(words[i].substring(0, 1), words[i].substring(0, 1).toUpperCase()) + " ");
} else {
sb.append(words[i] + " ");
}
}
}

Try using the html escape code for the pipe character ¦.
Your title would be:
"Monkey Thank You ¦ Monkey Thank You Cards"

I think the issue is in the fact that replaceFirst() takes a regex as parameter and a replacement string. Because you push in the first character as is for the regex parameter, what happens with the vertical bar is (omitting adding to the StringBuffer) equivalent to:
String addedToBuffer = "|".replaceFirst("|", "|".toUpperCase());
What happens then, is that we have a regex which matches the empty string or the empty string. Well, any string matches the empty string regex. So the match gets replaced by "|" (to upper case). So "|".replaceFirst("|", "|".toUpperCase()) expands to "||". So the append() call is given the parameter of "|| ".
You can fix your algorithm in two ways:
Fix the regex automatically, use literal notation in between \Q and \E. So your regex to pass to replaceFirst() becomes something like "\\Q"+ literal + "\\E".
Realise that you do not need regexes in the first place. Instead use two append() operations. One to append() the case converted first character of the item to add, the other to append the rest. This looks like this:
for(String s: items) {
if(s.equals("")) {
sb.append(" ");
}
else {
sb.append(Character.toUpperCase(s.charAt(0)));
if(s.length() > 1) {
sb.append(s.substring(1));
}
sb.append(" ");
}
}
The second approach is probably much easier to follow as well.
PS: For some reason the StackOverflow editor is vehemently disagreeing with code blocks in lists. If someone happens to know how to fix the munged formatting... ?

Related

Storing words from a .txt file into a String array

I was going through the answers of this question asked by someone previously and I found them to be very helpful. However, I have a question about the highlighted answer but I wasn't sure if I should ask there since it's a 6 year old thread.
My question is about this snippet of code given in the answers:
private static boolean isAWord(String token)
{
//check if the token is a word
}
How would you check that the token is a word? Would you .contains("\\s+") the string and check to see if it contains characters between them? But what about when you encounter a paragraph? I'm not sure how to go about this.
EDIT: I think I should've elaborated a bit more. Usually, you'd think a word would be something surrounded by " " but, for example, if the file contains a hyphen (which is also surrounded by a blank space), you'd want the isAWord() method to return false. How can I verify that something is actually a word and not punctuation?
Since the question wasn't entirely clear, I made two methods. First method consistsOfLetters just goes through the whole string and returns false if it has any numbers/symbols. This should be enough to determine if a token is word (if you don't mind if that words exists in dictionary or not).
public static boolean consistsOfLetters(String string) {
for(int i=0; i<string.length(); i++) {
if(string.charAt(i) == '.' && (i+1) == string.length() && string.length() != 1) break; // if last char of string is ., it is still word
if((string.toLowerCase().charAt(i) < 'a' || string.toLowerCase().charAt(i) > 'z')) return false;
} // toLowerCase is used to avoid having to compare it to A and Z
return true;
}
Second method helps us divide original String (for example a sentence of potentional words) based on " " character. When that is done, we go through every element there and check if it is a word. If it's not a word it returns false and skips the rest. If everything is fine, returns true.
public static boolean isThisAWord(String string) {
String[] array = string.split(" ");
for(int i = 0; i < array.length; i++) {
if(consistsOfLetters(array[i]) == false) return false;
}
return true;
}
Also, this might not work for English since English has apostrophes in words like "don't" so a bit of further tinkering is needed.
The Scanner in java splits string using his WHITESPACE_PATTERN by default, so splitting a string like "He's my friend" would result in an array like ["He's", "my", "friend"].
If that is sufficient, just remove that if clause and dont use that method.
If you want to make it to "He","is" instead of "He's", you need a different approach.
In short: The method works like verification check -> if the given token is not supposed to be in the result, then return false, true otherwise.
return token.matches("[\\pL\\pM]+('(s|nt))?");
matches requires the entire string to match.
This takes letters \pL and zero-length combining diacritical marks \pM (accents).
And possibly for English apostrophe, should you consider doesn't and let's one term (for instance for translation purposes).
You might also consider hyphens.
There are several single quotes and dashes.
Path path = Paths.get("..../x.txt");
Charset charset = Charset.defaultCharset();
String content = Files.readString(path, charset)
Pattern wordPattern = Pattern.compile("[\\pL\\pM]+");
Matcher m = wordPattern.matcher(content);
while (m.find()) {
String word = m.group(); ...
}

Java Regex does not match

I know that this kind of questions are proposed very often, but
I can't figure out why this RegEx does not match.
I want to check if there is a "M" at the beginning of the line, or not.
Finaly, i want the path at the end of the line.
This is why startsWith() doesn't fit my Needs.
line = "M 72208 70779 koj src\com\company\testproject\TestDomainf1.java";
if (line.matches("^(M?)(.*)$")) {}
I've also tried the other way out:
Pattern p = Pattern.compile("(M?)");
Matcher m = datePatt.matcher(line);
if (m.matches()) {
System.out.println("yay!");
}
if (line.matches("(M?)(.*)")) {}
Thanks
The correct regex would be simply
line.matches("M.*")
since the matches method enforces that the whole input sequence must match. However, this is such a simple problem that I wonder if you really need a regex for it. A plain
line.startsWith("M")
or
line.length() > 0 && line.charAt(0) == 'M'
or even just
line.indexOf('M') == 0
will work for your requirement.
Performance?
If you are also interested in performance, my second and third options win in that department, whereas the first one may easily be the slowest option: it must first compile the regex, then evaluate it. indexOf has the problem that its worst case is scanning the whole string.
UPDATE
In the meantime you have completely restated your question and made it clear that the regex is what you really need. In this case the following should work:
Matcher m = Pattern.compile("M.*?(\\S+)").matcher(input);
System.out.println(m.matches()? m.group(1) : "no match");
Note, this only works if the path doesn't contain spaces. If it does, then the problem is much harder.
You dont need a regex for that. Just use String#startsWith(String)
if (line.startsWith("M")) {
// code here
}
OR else use String#toCharArray():
if (line.length() > 0 && line.toCharArray()[0] == 'M') {
// code here
}
EDIT: After your edited requirement to get path from input string.
You still can avoid regex and have your code like this:
String path="";
if (line.startsWith("M"))
path = line.substring(line.lastIndexOf(' ')+1);
System.out.println(path);
OUTPUT:
src\com\company\testproject\TestDomainf1.java
You can use this pattern to check whether an M character appears as at the beginning of the string:
if (line.matches("M.*"))
But for something this simple, you can just use this:
if (line.length() > 0 && line.charAt(0) == 'M')
Why not do this
line.startsWith("M");
String str = new String("M 72208 70779 kij src/com/knapp/testproject/TestDomainf1.java");
if(str.startsWith("M") ){
------------------------
------------------------
}
If you need Path, you can split (I guess than \t is the separator) the string and take the latest field:
String[] tabS = "M 72208 70779 kij src\com\knapp\testproject\TestDomainf1.java".split("\t");
String path = tabS[tabS.length-1];

How to split a string in Java using "%*%" as separator, including the separator in the result list of strings?

I'm looking for the simplest way of tokenizing strings such as
INPUT OUTPUT
"hello %my% world" -> "hello ", "%my%", " world"
in Java. Is it possible to accomplish this with regex? I am basically looking for a String.split() that takes as separator something of the form "%*%" but that won't ignore it, as it seems to generally do.
Thanks
No, you can't do this the way you explained it. The reason is--it's ambiguous!
You give the example:
"hello %my% world" -> "hello ", "%my%", " world"
Should the % be attached to the string before it or after it?
Should the output be
"hello ", "%my", "% world"
Or, perhaps the output should be
"hello %", "my%", " world"
In your example you don't follow either of these rules. You come up with %my% which attaches the delimiter first to the string after it appears and then to the string before it appears.
Do you see the ambiguity?
So, you first need to come up with a clear set of rules about where you want the delimeter to be attached to. Once you do this, one simple (although not particularly efficient since Strings are immutable) way of achieving what you want is to:
Use String.split() to split the strings in the normal way
Follow your rule set to re-add the delimiter to where it should be in the string.
A simpler solution would be to just split the string by %s. That way, every other subsequence would have been between %s. All you have to do afterwards is iterate over the results, toggling a flag to know if the result is a regular string or one between %s.
Special attention has to be taken to the split implementation, how does it handle empty subsequences. Some implementations decide to discard empty subsequences at the begin/end of the input, others discard all empty subsequences and others discard none of them.
This would not result in the exact output that you want, since the %s would be gone. However you can easily add those back if there is an actual need for them (and I presume there isn't).
why not you split by space between your words. in that case you will get "hello","%my%","world".
If possible, use a simpler delimiter. And I'm okay with jury-rigging "%" as your delimiter, just so you can get String.split() instead of regexps. But if that's not possible...
Regexps! You can parse this using a Matcher. If you know there's one delimiter per line, you specify a pattern that eats the whole line:
String singleDelimRegexp = "(.*)(%[^%]*%)(.*)";
Pattern singleDelimPattern = Pattern.compile(singleDelimRegexp);
Matcher singleDelimMatcher = singleDelimPattern.matcher(input);
if (singleDelimMatcher.matches()) {
String before = singleDelimMatcher.group(1);
String delim = singleDelimMatcher.group(2);
String after = singleDelimMatcher.group(3);
System.out.println(before + "//" + delim + "//" + after);
}
If the input is long and you need a chain of results, you use Matcher in a loop:
String multiDelimRegexp = "%[^%]*%";
Pattern multiDelimPattern = Pattern.compile(multiDelimRegexp);
Matcher multiDelimMatcher = multiDelimPattern.matcher(input);
int lastEnd = 0;
while (multiDelimMatcher.find()) {
String data = input.substring(lastEnd, multiDelimMatcher.start());
String delim = multiDelimMatcher.group();
lastEnd = multiDelimMatcher.end();
System.out.println(data);
System.out.println(delim);
}
String lastData = input.substring(lastEnd);
System.out.println(lastData);
Add those to a data structure as you go, and you'll build the whole parsed input.
Running on input: http://ideone.com/s8FzeW

Elegant algorithm to split a string by comma or double quotes pair in Java

The question is pretty simple.
A CSV file looks like this:
1, "John", "John Joy"
If I want to get each column, I just use String[] splits = line.split(",");
What if the CSV file looks like this:
1, "John", "Joy, John"
So we have a comma inside a double quotes pair. The above split won't work any more, because I want "Joy, John" as a complete part.
So is there a elegant / simple algorithm to deal with this situation?
Edit:
Please do not consider it as a formal CSV parsing thing. I just use CSV as a use case where I need to split.
What I really want is NOT a proper CSV parser, instead, I just want an algorithm which can properly split a line by comma considering the double quotes.
It's better to use existing library for this purpuse instead of writing custom implementation (If you don't do this for studing).
Because CSV has some specifics that you can miss in custom implementation and usually library is well tested.
Here you can find some good one Can you recommend a Java library for reading (and possibly writing) CSV files?
EDIT
I've created method that will parse your string but again it could work not perfect because I haven't tested it well.
It could be just as a start point for you and you can improve it further.
String inputString = "1, \"John\",\"Joy, John\"";
char quote = '"';
List<String> csvList = new ArrayList<String>();
boolean inQuote = false;
int lastStart = 0;
for (int i = 0; i < inputString.length(); i++) {
if ((i + 1) == inputString.length()) {
//if this is the last character
csvList.add(inputString.substring(lastStart, i + 1));
}
if (inputString.charAt(i) == quote) {
//if the character is quote
if (inQuote) {
inQuote = false;
continue; //escape
}
inQuote = true;
continue;
}
if (inputString.charAt(i) == ',') {
if (inQuote) continue;
csvList.add(inputString.substring(lastStart, i));
lastStart = i + 1;
}
}
System.out.println(csvList);
Question for you
What if you will get string like that 1, "John", ""Joy, John""
(two quotes on "Joy, John")?
// use regxep with matcher
String string1 = "\"John\", \"John Joy\"";
String string2 = "\"John\", \"Joy, John\"";
Pattern pattern = Pattern.compile("\"[^\"]+\"");
Matcher matcher = pattern.matcher(string1);
System.out.println("string1: " + string1);
int start = 0;
while(matcher.find(start)){
System.out.println(matcher.group());
start = matcher.end() + 1;
if(start > string1.length())
break;
}
matcher = pattern.matcher(string2);
System.out.println("string2: " + string2);
start = 0;
while(matcher.find(start)){
System.out.println(matcher.group());
start = matcher.end() + 1;
if(start > string2.length())
break;
}
Using regular expressions is quite elegant.
Sorry, I don't familiar with Java regex, so my example is in Lua:
(this example doesn't take into account that there may be newline chars inside quoted text, and that original quote chars would be doubled inside quoted text)
--- file.csv
1, "John", "John Joy"
2, "John", "Joy, John"
--- Lua code
for line in io.lines 'file.csv' do
print '==='
for _, s in (line..','):gmatch '%s*("?)(.-)%1%s*,' do
print(s)
end
end
--- Output
===
1
John
John Joy
===
2
John
Joy, John
You could start with the regular expression:
[^",]*|"[^"]*"
which matches either a non-quoted string not containing a comma or a quoted string. However, there are lots of questions, including:
Do you really have spaces after the commas in your input? Or, more generally, will you allow quotes which are not exactly at the first character of a field?
How do you put quotes around a field which includes a quote?
Depending on how you answer that question, you might end up with different regular expressions. (Indeed, the customary advice to use a CSV parsing library is not so much about handling the corner cases; it is about not having to think about them because you assume "standard CSV" handling, whatever that might be according to the author of the parsing library. CSV is a mess.)
One regular expression I've used with some success (although it is not CSV compatible) is:
(?:[^",]|"[^"]*")*
which is pretty similar to the first one, except that it allows any number of concatenated fields, so that both of the following are all recognized as a single field:
"John"", Mary"
John", "Mary
CSV standard would treat the first one as representing:
John", Mary -- internal quote
and treat the quotes in the second one as ordinary characters, resulting in two fields. So YMMV.
In any event, once you decide on an appropriate regex, the algorithm is simple. In pseudo-code since I'm far from a Java expert.
repeat:
match the regex at the current position
and append the result to the result;
if the match fails:
report error
if the match goes to the end of the string:
done
if the next character is a ',':
advance the position by one
otherwise:
report error
Depending on the regex, the two conditions under which you report an error might not be possible. Generally, the first one will trigger if the quoted field is not terminated (and you need to decide whether to allow new-lines in the quoted field -- CSV does). The second one might happen if you used the first regex I provided and then didn't immediately follow the quoted string with a comma.
First split the string on quotes. Odd segments will have quoted content; even ones will have to be split one more time on commas. I use it on logs, where quoted text doesn't have escaped quotes, just like in this question.
boolean quoted = false;
for(String q : str.split("\"")) {
if(quoted)
System.out.println(q.trim());
else
for(String s : q.split(","))
if(!s.trim().isEmpty())
System.out.println(s.trim());
quoted = !quoted;
}

How do I delete specific characters from a particular String in Java?

For example I'm extracting a text String from a text file and I need those words to form an array. However, when I do all that some words end with comma (,) or a full stop (.) or even have brackets attached to them (which is all perfectly normal).
What I want to do is to get rid of those characters. I've been trying to do that using those predefined String methods in Java but I just can't get around it.
Reassign the variable to a substring:
s = s.substring(0, s.length() - 1)
Also an alternative way of solving your problem: you might also want to consider using a StringTokenizer to read the file and set the delimiters to be the characters you don't want to be part of words.
Use:
String str = "whatever";
str = str.replaceAll("[,.]", "");
replaceAll takes a regular expression. This:
[,.]
...looks for each comma and/or period.
To remove the last character do as Mark Byers said
s = s.substring(0, s.length() - 1);
Additionally, another way to remove the characters you don't want would be to use the .replace(oldCharacter, newCharacter) method.
as in:
s = s.replace(",","");
and
s = s.replace(".","");
You can't modify a String in Java. They are immutable. All you can do is create a new string that is substring of the old string, minus the last character.
In some cases a StringBuffer might help you instead.
The best method is what Mark Byers explains:
s = s.substring(0, s.length() - 1)
For example, if we want to replace \ to space " " with ReplaceAll, it doesn't work fine
String.replaceAll("\\", "");
or
String.replaceAll("\\$", ""); //if it is a path
Note that the word boundaries also depend on the Locale. I think the best way to do it using standard java.text.BreakIterator. Here is an example from the java.sun.com tutorial.
import java.text.BreakIterator;
import java.util.Locale;
public static void main(String[] args) {
String text = "\n" +
"\n" +
"For example I'm extracting a text String from a text file and I need those words to form an array. However, when I do all that some words end with comma (,) or a full stop (.) or even have brackets attached to them (which is all perfectly normal).\n" +
"\n" +
"What I want to do is to get rid of those characters. I've been trying to do that using those predefined String methods in Java but I just can't get around it.\n" +
"\n" +
"Every help appreciated. Thanx";
BreakIterator wordIterator = BreakIterator.getWordInstance(Locale.getDefault());
extractWords(text, wordIterator);
}
static void extractWords(String target, BreakIterator wordIterator) {
wordIterator.setText(target);
int start = wordIterator.first();
int end = wordIterator.next();
while (end != BreakIterator.DONE) {
String word = target.substring(start, end);
if (Character.isLetterOrDigit(word.charAt(0))) {
System.out.println(word);
}
start = end;
end = wordIterator.next();
}
}
Source: http://java.sun.com/docs/books/tutorial/i18n/text/word.html
You can use replaceAll() method :
String.replaceAll(",", "");
String.replaceAll("\\.", "");
String.replaceAll("\\(", "");
etc..

Categories