String.split by semicolon - java

I want to split a string by semicolon(";"):
String phrase = "‫;‪14/May/2015‬‬ ‫‪FC‬‬ ‫‪Barcelona‬‬ ‫‪VS.‬‬ ‫‪Real‬‬ ‫‪Madrid";
String[] dateSplit = phrase.split(";");
System.out.println("dateSplit[0]:" + dateSplit[0]);
System.out.println("dateSplit[1]:" + dateSplit[1]);
But it removes the ";" from string and puts all string to 'datesplit1'
so the output is:
dateSplit[0]:‫
dateSplit[1]:‪14/May/2015‬‬ ‫‪FC‬‬ ‫‪Barcelona‬‬ ‫‪VS.‬‬ ‫‪Real‬‬ ‫‪Madrid`
Demo
and on doing
System.out.println("Real String :"+phrase);
string printed is
Real String :‫;‪14/May/2015‬‬ ‫‪FC‬‬ ‫‪Barcelona‬‬ ‫‪VS.‬‬ ‫‪Real‬‬ ‫‪Madrid

The phrase contains bi-directional characters like right-to-left embedding. It's why some editors don't manage to display correctly the string.
This piece of code shows the actual characters in the String (for some people the phrase won't display here the right way, but it compiles and looks fine in Eclipse). I just translate left-right with ->, right-to-left with <- and pop directions with ^:
public static void main(String[]args) {
String phrase = "‫;‪14/May/2015‬‬ ‫‪FC‬‬ ‫‪Barcelona‬‬ ‫‪VS.‬‬ ‫‪Real‬‬ ‫‪Madrid";
String[] dateSplit = phrase.split(";");
for (String d : dateSplit) {
System.out.println(d);
}
char[] c = phrase.toCharArray();
StringBuilder p = new StringBuilder();
for (int i = 0; i < c.length;i++) {
int code = Character.codePointAt(c, i);
switch (code) {
case 8234:
p.append(" -> ");
break;
case 8235:
p.append(" <- ");
break;
case 8236:
p.append(" ^ ");
break;
default:
p.append(c[i]);
}
}
System.out.println(p.toString());
}
Prints:
<- ; -> 14/May/2015 ^ ^ <- -> FC ^ ^ <- -> Barcelona ^ ^ <- -> VS. ^ ^ <- -> Real ^ ^ <- -> Madrid
The String#split() will work on the actual character string and not on what the editor displays, hence you can see the ; is the second character after a right-to-left, which gives (beware of display again: the ; is not part of the string in dateSplit[1]):
dateSplit[0] = "";
dateSplit[1] = "14/May/2015‬‬ ‫‪FC‬‬ ‫‪Barcelona‬‬ ‫‪VS.‬‬ ‫‪Real‬‬ ‫‪Madrid";
I guess you are processing data from a language writing/reading from right-to-left and there is some mixing with the football team names which are left-to-right. The solution is certainly to get rid of directional characters and put the ; at the right place, i.e as a separator for the token.

I rewrote your code, instead of coping from here and its working perfectly fine.
public static void main(String[] args) {
String phrase = "14/May/2015; FC Barcelona VS. Real Madrid";
String[] dateSplit = phrase.split(";");
System.out.println("dateSplit[0]:" + dateSplit[0]);
System.out.println("dateSplit[1]:" + dateSplit[1]);
}
Demo

Cut and pasting your code into IntelliJ screwed up the editor; as #Palcente said, possible encoding issues.
However, I would recommend usinge a StringTokenizer instead.
StringTokenizer sTok = new StringTokenizer(phrase, ";");
You can then iterate over it, which leads to nicer (and safer) code.

Related

How to remove special characters from input text

I want to remove all special characters from input text as well as some restricted words.
Whatever the things I want to remove, that will come dynamically
(Let me clarify this: Whatever the words I need to exclude they will be provided dynamically - the user will decide what needs to be excluded. That is the reason I did not include regex. restricted_words_list (see my code) will get from the database just to check the code working or not I kept statically ),
but for demonstration purposes, I kept them in a String array to confirm whether my code is working properly or not.
public class TestKeyword {
private static final String[] restricted_words_list={"#","of","an","^","#","<",">","(",")"};
private static final Pattern restrictedReplacer;
private static Set<String> restrictedWords = null;
static {
StringBuilder strb= new StringBuilder();
for(String str:restricted_words_list){
strb.append("\\b").append(Pattern.quote(str)).append("\\b|");
}
strb.setLength(strb.length()-1);
restrictedReplacer = Pattern.compile(strb.toString(),Pattern.CASE_INSENSITIVE);
strb = new StringBuilder();
}
public static void main(String[] args)
{
String inputText = "abcd abc# cbda ssef of jjj t#he g^g an wh&at ggg<g ss%ss ### (()) D^h^D";
System.out.println("inputText : " + inputText);
String modifiedText = restrictedWordCheck(inputText);
System.out.println("Modified Text : " + modifiedText);
}
public static String restrictedWordCheck(String input){
Matcher m = restrictedReplacer.matcher(input);
StringBuffer strb = new StringBuffer(input.length());//ensuring capacity
while(m.find()){
if(restrictedWords==null)restrictedWords = new HashSet<String>();
restrictedWords.add(m.group()); //m.group() returns what was matched
m.appendReplacement(strb,""); //this writes out what came in between matching words
for(int i=m.start();i<m.end();i++)
strb.append("");
}
m.appendTail(strb);
return strb.toString();
}
}
The output is :
inputText : abcd abc# cbda ssef of jjj t#he g^g an wh&at ggg
Modified Text : abcd abc# cbda ssef jjj the gg wh&at gggg ss%ss ### (()) DhD
Here the excluded words are of and an, but only some of the special characters, not all that I specified in restricted_words_list
Now I got a better Solution:
String inputText = title;// assigning input
List<String> restricted_words_list = catalogueService.getWordStopper(); // getting all stopper words from database dynamically (inside getWordStopper() method just i wrote a query and getting list of words)
String finalResult = "";
List<String> stopperCleanText = new ArrayList<String>();
String[] afterTextSplit = inputText.split("\\s"); // split and add to list
for (int i = 0; i < afterTextSplit.length; i++) {
stopperCleanText.add(afterTextSplit[i]); // adding to list
}
stopperCleanText.removeAll(restricted_words_list); // remove all word stopper
for (String addToString : stopperCleanText)
{
finalResult += addToString+";"; // add semicolon to cleaned text
}
return finalResult;
public String replaceAll(String regex,
String replacement)
Replaces each substring of this string (which matches the given regular expression) with the given replacement.
Parameters:
regex - the regular expression to which this string is to be
matched
replacement - the string to be substituted for each match.
So you just need to provide replacement parameter with an empty String.
You should change your loop
for(String str:restricted_words_list){
strb.append("\\b").append(Pattern.quote(str)).append("\\b|");
}
to this:
for(String str:restricted_words_list){
strb.append("\\b*").append(Pattern.quote(str)).append("\\b*|");
}
Because with your loop you're matching the restricted_words_list elements only if there is something before and after the match. Since abc# does not have anything after the # it will not be replaced. If you add * (which means 0 or more occurences) to the \\b on either side it will match things like abc# as well.
You may consider to use Regex directly to replace those special character with empty ''? Check it out: Java; String replace (using regular expressions)?, some tutorial here: http://www.vogella.com/articles/JavaRegularExpressions/article.html
You can also do like this :
String inputText = "abcd abc# cbda ssef of jjj t#he g^g an wh&at ggg<g ss%ss ### (()) D^h^D";
String regx="([^a-z^ ^0-9]*\\^*)";
String textWithoutSpecialChar=inputText.replaceAll(regx,"");
System.out.println("Without Special Char:"+textWithoutSpecialChar);
String yourSetofString="of|an"; // your restricted words.
String op=textWithoutSpecialChar.replaceAll(yourSetofString,"");
System.out.println("output : "+op);
o/p :
Without Special Char:abcd abc cbda ssef of jjj the gg an what gggg ssss h
output : abcd abc cbda ssef jjj the gg what gggg ssss h
String s = "abcd abc# cbda ssef of jjj t#he g^g an wh&at ggg (blah) and | then";
String[] words = new String[]{ " of ", "|", "(", " an ", "#", "#", "&", "^", ")" };
StringBuilder sb = new StringBuilder();
for( String w : words ) {
if( w.length() == 1 ) {
sb.append( "\\" );
}
sb.append( w ).append( "|" );
}
System.out.println( s.replaceAll( sb.toString(), "" ) );

replaceFirst for character "`"

First time here. I'm trying to write a program that takes a string input from the user and encode it using the replaceFirst method. All letters and symbols with the exception of "`" (Grave accent) encode and decode properly.
e.g. When I input
`12
I am supposed to get 28AABB as my encryption, but instead, it gives me BB8AA2
public class CryptoString {
public static void main(String[] args) throws IOException, ArrayIndexOutOfBoundsException {
String input = "";
input = JOptionPane.showInputDialog(null, "Enter the string to be encrypted");
JOptionPane.showMessageDialog(null, "The message " + input + " was encrypted to be "+ encrypt(input));
public static String encrypt (String s){
String encryptThis = s.toLowerCase();
String encryptThistemp = encryptThis;
int encryptThislength = encryptThis.length();
for (int i = 0; i < encryptThislength ; ++i){
String test = encryptThistemp.substring(i, i + 1);
//Took out all code with regard to all cases OTHER than "`" "1" and "2"
//All other cases would have followed the same format, except with a different string replacement argument.
if (test.equals("`")){
encryptThis = encryptThis.replaceFirst("`" , "28");
}
else if (test.equals("1")){
encryptThis = encryptThis.replaceFirst("1" , "AA");
}
else if (test.equals("2")){
encryptThis = encryptThis.replaceFirst("2" , "BB");
}
}
}
I've tried putting escape characters in front of the grave accent, however, it is still not encoding it properly.
Take a look at how your program works in each loop iteration:
i=0
encryptThis = '12 (I used ' instead of ` to easier write this post)
and now you replace ' with 28 so it will become 2812
i=1
we read character at position 1 and it is 1 so
we replace 1 with AA making 2812 -> 28AA2
i=2
we read character at position 2, it is 2 so
we replace first 2 with BB making 2812 -> BB8AA2
Try maybe using appendReplacement from Matcher class from java.util.regex package like
public static String encrypt(String s) {
Map<String, String> replacementMap = new HashMap<>();
replacementMap.put("`", "28");
replacementMap.put("1", "AA");
replacementMap.put("2", "BB");
Pattern p = Pattern.compile("[`12]"); //regex that will match ` or 1 or 2
Matcher m = p.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find()){//we found one of `, 1, 2
m.appendReplacement(sb, replacementMap.get(m.group()));
}
m.appendTail(sb);
return sb.toString();
}
encryptThistemp.substring(i, i + 1); The second parameter of substring is length, are you sure you want to be increasing i? because this would mean after the first iteration test would not be 1 character long. This could throw off your other cases which we cannot see!

Android - Editing my String so each word starts with a capital

I was wondering if someone could provide me some code or point me towards a tutrial which explain how I can convert my string so that each word begins with a capital.
I would also like to convert a different string in italics.
Basically, what my app is doing is getting data from several EditText boxes and then on a button click is being pushed onto the next page via intent and being concatenated into 1 paragraph. Therefore, I assume I need to edit my string on the intial page and make sure it is passed through in the same format.
Thanks in advance
You can use Apache StringUtils. The capitalize method will do the work.
For eg:
WordUtils.capitalize("i am FINE") = "I Am FINE"
or
WordUtils.capitalizeFully("i am FINE") = "I Am Fine"
Here is a simple function
public static String capEachWord(String source){
String result = "";
String[] splitString = source.split(" ");
for(String target : splitString){
result
+= Character.toUpperCase(target.charAt(0))
+ target.substring(1) + " ";
}
return result.trim();
}
The easiest way to do this is using simple Java built-in functions.
Try something like the following (method names may not be exactly right, doing it off the top of my head):
String label = Capitalize("this is my test string");
public String Capitalize(String testString)
{
String[] brokenString = testString.split(" ");
String newString = "";
for(String s : brokenString)
{
s.charAt(0) = s.charAt(0).toUpper();
newString += s + " ";
}
return newString;
}
Give this a try, let me know if it works for you.
Just add android:inputType="textCapWords" to your EditText in layout xml. This wll make all the words start with the Caps letter.
Strings are immutable in Java, and String.charAt returns a value, not a reference that you can set (like in C++). Pheonixblade9's will not compile. This does what Pheonixblade9 suggests, except it compiles.
public String capitalize(String testString) {
String[] brokenString = testString.split(" ");
String newString = "";
for (String s : brokenString) {
char[] chars = s.toCharArray();
chars[0] = Character.toUpperCase(chars[0]);
newString = newString + new String(chars) + " ";
}
//the trim removes trailing whitespace
return newString.trim();
}
String source = "hello good old world";
StringBuilder res = new StringBuilder();
String[] strArr = source.split(" ");
for (String str : strArr) {
char[] stringArray = str.trim().toCharArray();
stringArray[0] = Character.toUpperCase(stringArray[0]);
str = new String(stringArray);
res.append(str).append(" ");
}
System.out.print("Result: " + res.toString().trim());

Reformatting a Java String

I have a string that looks like this:
CALDARI_STARSHIP_ENGINEERING
and I need to edit it to look like
Caldari Starship Engineering
Unfortunately it's three in the morning and I cannot for the life of me figure this out. I've always had trouble with replacing stuff in strings so any help would be awesome and would help me understand how to do this in the future.
Something like this is simple enough:
String text = "CALDARI_STARSHIP_ENGINEERING";
text = text.replace("_", " ");
StringBuilder out = new StringBuilder();
for (String s : text.split("\\b")) {
if (!s.isEmpty()) {
out.append(s.substring(0, 1) + s.substring(1).toLowerCase());
}
}
System.out.println("[" + out.toString() + "]");
// prints "[Caldari Starship Engineering]"
This split on the word boundary anchor.
See also
regular-expressions.info/Word boundary
Matcher loop solution
If you don't mind using StringBuffer, you can also use Matcher.appendReplacement/Tail loop like this:
String text = "CALDARI_STARSHIP_ENGINEERING";
text = text.replace("_", " ");
Matcher m = Pattern.compile("(?<=\\b\\w)\\w+").matcher(text);
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, m.group().toLowerCase());
}
m.appendTail(sb);
System.out.println("[" + sb.toString() + "]");
// prints "[Caldari Starship Engineering]"
The regex uses assertion to match the "tail" part of a word, the portion that needs to be lowercased. It looks behind (?<=...) to see that there's a word boundary \b followed by a word character \w. Any remaining \w+ would then need to be matched so it can be lowercased.
Related questions
Use Java and RegEx to convert casing in a string
Java regex does not support Perl preprocessing operations \l \u, \L, and \U.
Java split is eating my characters.
More examples of using assertions
StringBuilder and StringBuffer in Java
Unfortunately, appendReplacement/Tail only takes StringBuffer
You can try this:
String originalString = "CALDARI_STARSHIP_ENGINEERING";
String newString =
WordUtils.capitalize(originalString.replace('_', ' ').toLowerCase());
WordUtils are part of the Commons Lang libraries (http://commons.apache.org/lang/)
Using reg-exps:
String s = "CALDARI_STARSHIP_ENGINEERING";
StringBuilder camel = new StringBuilder();
Matcher m = Pattern.compile("([^_])([^_]*)").matcher(s);
while (m.find())
camel.append(m.group(1)).append(m.group(2).toLowerCase());
Untested, but thats how I implemented the same some time ago:
s = "CALDARI_STARSHIP_ENGINEERING";
StringBuilder b = new StringBuilder();
boolean upper = true;
for(char c : s.toCharArray()) {
if( upper ) {
b.append(c);
upper = false;
} else if( c = '_' ) {
b.append(" ");
upper = true;
} else {
b.append(Character.toLowerCase(c));
}
}
s = b.toString();
Please note that the EVE license agreements might forbit writing external tools that help you in your careers. And it might be the trigger for you to learn Python, because most of EVE is written in Python :).
Quick and dirty way:
Lower case all
line.toLowerCase();
Split into words:
String[] words = line.split("_");
Then loop through words capitalising first letter:
words[i].substring(0, 1).toUpperCase()

Tokenize a string with a space in java

I want to tokenize a string like this
String line = "a=b c='123 456' d=777 e='uij yyy'";
I cannot split based like this
String [] words = line.split(" ");
Any idea how can I split so that I get tokens like
a=b
c='123 456'
d=777
e='uij yyy';
The simplest way to do this is by hand implementing a simple finite state machine. In other words, process the string a character at a time:
When you hit a space, break off a token;
When you hit a quote keep getting characters until you hit another quote.
Depending on the formatting of your original string, you should be able to use a regular expression as a parameter to the java "split" method: Click here for an example.
The example doesn't use the regular expression that you would need for this task though.
You can also use this SO thread as a guideline (although it's in PHP) which does something very close to what you need. Manipulating that slightly might do the trick (although having quotes be part of the output or not may cause some issues). Keep in mind that regex is very similar in most languages.
Edit: going too much further into this type of task may be ahead of the capabilities of regex, so you may need to create a simple parser.
line.split(" (?=[a-z+]=)")
correctly gives:
a=b
c='123 456'
d=777
e='uij yyy'
Make sure you adapt the [a-z+] part in case your keys structure changes.
Edit: this solution can fail miserably if there is a "=" character in the value part of the pair.
StreamTokenizer can help, although it is easiest to set up to break on '=', as it will always break at the start of a quoted string:
String s = "Ta=b c='123 456' d=777 e='uij yyy'";
StreamTokenizer st = new StreamTokenizer(new StringReader(s));
st.ordinaryChars('0', '9');
st.wordChars('0', '9');
while (st.nextToken() != StreamTokenizer.TT_EOF) {
switch (st.ttype) {
case StreamTokenizer.TT_NUMBER:
System.out.println(st.nval);
break;
case StreamTokenizer.TT_WORD:
System.out.println(st.sval);
break;
case '=':
System.out.println("=");
break;
default:
System.out.println(st.sval);
}
}
outputs
Ta
=
b
c
=
123 456
d
=
777
e
=
uij yyy
If you leave out the two lines that convert numeric characters to alpha, then you get d=777.0, which might be useful to you.
Assumptions:
Your variable name ('a' in the assignment 'a=b') can be of length 1 or more
Your variable name ('a' in the assignment 'a=b') can not contain the space character, anything else is fine.
Validation of your input is not required (input assumed to be in valid a=b format)
This works fine for me.
Input:
a=b abc='123 456' &=777 #='uij yyy' ABC='slk slk' 123sdkljhSDFjflsakd#*#&=456sldSLKD)#(
Output:
a=b
abc='123 456'
&=777
#='uij yyy'
ABC='slk slk'
123sdkljhSDFjflsakd#*#&=456sldSLKD)#(
Code:
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
// SPACE CHARACTER followed by
// sequence of non-space characters of 1 or more followed by
// first occuring EQUALS CHARACTER
final static String regex = " [^ ]+?=";
// static pattern defined outside so that you don't have to compile it
// for each method call
static final Pattern p = Pattern.compile(regex);
public static List<String> tokenize(String input, Pattern p){
input = input.trim(); // this is important for "last token case"
// see end of method
Matcher m = p.matcher(input);
ArrayList<String> tokens = new ArrayList<String>();
int beginIndex=0;
while(m.find()){
int endIndex = m.start();
tokens.add(input.substring(beginIndex, endIndex));
beginIndex = endIndex+1;
}
// LAST TOKEN CASE
//add last token
tokens.add(input.substring(beginIndex));
return tokens;
}
private static void println(List<String> tokens) {
for(String token:tokens){
System.out.println(token);
}
}
public static void main(String args[]){
String test = "a=b " +
"abc='123 456' " +
"&=777 " +
"#='uij yyy' " +
"ABC='slk slk' " +
"123sdkljhSDFjflsakd#*#&=456sldSLKD)#(";
List<String> tokens = RegexTest.tokenize(test, p);
println(tokens);
}
}
Or, with a regex for tokenizing, and a little state machine that just adds the key/val to a map:
String line = "a = b c='123 456' d=777 e = 'uij yyy'";
Map<String,String> keyval = new HashMap<String,String>();
String state = "key";
Matcher m = Pattern.compile("(=|'[^']*?'|[^\\s=]+)").matcher(line);
String key = null;
while (m.find()) {
String found = m.group();
if (state.equals("key")) {
if (found.equals("=") || found.startsWith("'"))
{ System.err.println ("ERROR"); }
else { key = found; state = "equals"; }
} else if (state.equals("equals")) {
if (! found.equals("=")) { System.err.println ("ERROR"); }
else { state = "value"; }
} else if (state.equals("value")) {
if (key == null) { System.err.println ("ERROR"); }
else {
if (found.startsWith("'"))
found = found.substring(1,found.length()-1);
keyval.put (key, found);
key = null;
state = "key";
}
}
}
if (! state.equals("key")) { System.err.println ("ERROR"); }
System.out.println ("map: " + keyval);
prints out
map: {d=777, e=uij yyy, c=123 456, a=b}
It does some basic error checking, and takes the quotes off the values.
This solution is both general and compact (it is effectively the regex version of cletus' answer):
String line = "a=b c='123 456' d=777 e='uij yyy'";
Matcher m = Pattern.compile("('[^']*?'|\\S)+").matcher(line);
while (m.find()) {
System.out.println(m.group()); // or whatever you want to do
}
In other words, find all runs of characters that are combinations of quoted strings or non-space characters; nested quotes are not supported (there is no escape character).
public static void main(String[] args) {
String token;
String value="";
HashMap<String, String> attributes = new HashMap<String, String>();
String line = "a=b c='123 456' d=777 e='uij yyy'";
StringTokenizer tokenizer = new StringTokenizer(line," ");
while(tokenizer.hasMoreTokens()){
token = tokenizer.nextToken();
value = token.contains("'") ? value + " " + token : token ;
if(!value.contains("'") || value.endsWith("'")) {
//Split the strings and get variables into hashmap
attributes.put(value.split("=")[0].trim(),value.split("=")[1]);
value ="";
}
}
System.out.println(attributes);
}
output:
{d=777, a=b, e='uij yyy', c='123 456'}
In this case continuous space will be truncated to single space in the value.
here attributed hashmap contains the values
import java.io.*;
import java.util.Scanner;
public class ScanXan {
public static void main(String[] args) throws IOException {
Scanner s = null;
try {
s = new Scanner(new BufferedReader(new FileReader("<file name>")));
while (s.hasNext()) {
System.out.println(s.next());
<write for output file>
}
} finally {
if (s != null) {
s.close();
}
}
}
}
java.util.StringTokenizer tokenizer = new java.util.StringTokenizer(line, " ");
while (tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
int index = token.indexOf('=');
String key = token.substring(0, index);
String value = token.substring(index + 1);
}
Have you tried splitting by '=' and creating a token out of each pair of the resulting array?

Categories