Why is my Java split method causing an empty line to print? - java

I am working on a short assignment from school and for some reason, two delimiters right next to each other is causing an empty line to print. I would understand this if there was a space between them but there isn't. Am I missing something simple? I would like each token to print one line at a time without printing the ~.
public class SplitExample
{
public static void main(String[] args)
{
String asuURL = "www.public.asu.edu/~JohnSmith/CSE205";
String[] words = new String[6];
words = asuURL.split("[./~]");
for(int i = 0; i < words.length; i++)
{
System.out.println(words[i]);
}
}
}//end SplitExample
Edit: Desired output below
www
public
asu
edu
JohnSmith
CSE205

I would understand this if there was a space between them but there
isn't
Yeah sure there is no space between them, but there is an empty string between them. In fact, you have empty string between each pair of character.
That is why when your delimiter are next to each other, the split will get the empty string between them.
I would like each token to print one line at a time without printing
the ~.
Then why have you included / in your delimiters? [./~] means match ., or /, or ~. Splitting on them will also not print the /.
If you just want to split on ~., then just use them in character class - [.~]. But again, it's not really clear, what output you want exactly. May be you can post an expected output for your current input.
Seems like you are splitting on ., / and /~. In which case, you can't use character class here. You can use this pattern to split: -
String[] words = asuURL.split("[.]|/~?");
This will split on: - ., /~ or / (Since ~ is optional)

What do you think the spit produces? What's between / and ~ ? Yes, that's right, there is an empty string ("").

There are no characters between /~ so when you split on these characters you should expect to see a blank string.
I suspect you don't need to split on ~ and in fact I wouldn't split on . either.

Related

How to write a regex to split a String in this format?

I want to use [,.!?;~] to split a string, but I want to remain the [,.!?;~] to its place for example:
This is the example, but it is not enough
To
[This is the example,, but it is not enough] // length=2
[0]=This is the example,
[1]=but it is not enough
As you can see the comma is still in its place. I did this with this regex (?<=([,.!?;~])+). But I want if some special word (e.g: but) comes after the [,.!?;~], then do not split that part of string. For example:
I want this sentence to be split into this form, but how to do. So if
anyone can help, that will be great
To
[0]=I want this sentence to be split into this form, but how to do.
[1]=So if anyone can help,
[2]=that will be great
As you can see this part (form, but) is not split int the first sentence.
I've used:
Positive Lookbehind (?<=a)b to keep the delimiter.
Negative Lookahead a(?!b) to rule out stop words.
Notice how I've appended RegEx (?!\\s*(but|and|if)) after your provided RegEx. You can put all those stop words that you've to rule out (eg, but, and, if) inside the bracket separated by pipe symbol.
Also do notice that the delimiter is still in it's place.
Output
Count of tokens = 3
I want this sentence to be split into this form, but how to do.
So if anyone can help,
that will be great
Code
import java.lang.*;
public class HelloWorld {
public static void main(String[] args) {
String str = "I want this sentence to be split into this form, but how to do. So if anyone can help, that will be great";
//String delimiters = "\\s+|,\\s*|\\.\\s*";
String delimiters = "(?<=,)";
// analyzing the string
String[] tokensVal = str.split("(?<=([,.!?;~])+)(?!\\s*(but|and|if))");
// prints the number of tokens
System.out.println("Count of tokens = " + tokensVal.length);
for (String token: tokensVal) {
System.out.println(token);
}
}
}

Java Regex Metacharacters returning extra space while spliting

I want to split string using regex instead of StringTokenizer. I am using String.split(regex);
Regex contains meta characters and when i am using \[ it is returning extra space in returning array.
import java.util.Scanner;
public class Solution{
public static void main(String[] args) {
Scanner i= new Scanner(System.in);
String s= i.nextLine();
String[] st=s.split("[!\\[,?\\._'#\\+\\]\\s\\\\]+");
System.out.println(st.length);
for(String z:st)
System.out.println(z);
}
}
When i enter input [a\m]
It returns array length as 3 and
a m
Space is also there before a.
Can anyone please explain why this is happening and how can i correct it. I don't want extra space in resulting array.
Since the [ is at the beginning of the string, when split removes [, there appear two elements after the first split step: the empty string that is at the beginning of the string, and the rest of the string. String#split does not return trailing empty elements only (as it is executed with limit=0 by default).
Remove the characters you split against from the start (using a .replaceAll("^[!\\[,?._'#+\\]\\s\\\\]+", note the ^ at the beginning of the pattern). Here is a sample code you can leverage:
String[] st="[a\\m]".replaceAll("^[!\\[,?._'#+\\]\\s\\\\]+", "")
.split("[!\\[,?._'#+\\]\\s\\\\]+");
System.out.println(st.length);
for(String z:st) {
System.out.println(z);
}
See demo
As an addition to Wiktor Stribiżew’s answer, you may do the same without having to specify the pattern twice, by dealing with the java.util.regex package directly. Removing this redundancy may avoid potential errors and may also be more efficient as the pattern doesn’t need to be parsed twice:
Pattern p = Pattern.compile("[!\\[,?\\._'#\\+\\]\\s\\\\]+");
Matcher m = p.matcher(s);
if(m.lookingAt()) s=m.replaceFirst("");
String[] st = p.split(s);
for(String z:st)
System.out.println(z);
To be able to use the same pattern, i.e. without having to use the anchor ^ for removing a leading separator, we first check via lookingAt() whether the pattern really matches at the beginning of the text before removing the first occurrence. Then, we proceed with the split operation, but reusing the already prepared Pattern.
Regarding your issue mentioned in a comment, the split operation will always return at least one element, the input string, when there is no match, even when the string is empty. If you wish to have an empty array then, the only solution is to replace the result explicitly:
if(st.length==1 && s.equals[0]) st=new String[0];
or, if you only want to treat an empty string specially, you may check this beforehand:
if(s.isEmpty()) st=new String[0];
else {
// the code as shown above
}

split a string in java into equal length substrings while maintaining word boundaries

How to split a string into equal parts of maximum character length while maintaining word boundaries?
Say, for example, if I want to split a string "hello world" into equal substrings of maximum 7 characters it should return me
"hello "
and
"world"
But my current implementation returns
"hello w"
and
"orld "
I am using the following code taken from Split string to equal length substrings in Java to split the input string into equal parts
public static List<String> splitEqually(String text, int size) {
// Give the list the right capacity to start with. You could use an array
// instead if you wanted.
List<String> ret = new ArrayList<String>((text.length() + size - 1) / size);
for (int start = 0; start < text.length(); start += size) {
ret.add(text.substring(start, Math.min(text.length(), start + size)));
}
return ret;
}
Will it be possible to maintain word boundaries while splitting the string into substring?
To be more specific I need the string splitting algorithm to take into account the word boundary provided by spaces and not solely rely on character length while splitting the string although that also needs to be taken into account but more like a max range of characters rather than a hardcoded length of characters.
If I understand your problem correctly then this code should do what you need (but it assumes that maxLenght is equal or greater than longest word)
String data = "Hello there, my name is not importnant right now."
+ " I am just simple sentecne used to test few things.";
int maxLenght = 10;
Pattern p = Pattern.compile("\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)", Pattern.DOTALL);
Matcher m = p.matcher(data);
while (m.find())
System.out.println(m.group(1));
Output:
Hello
there, my
name is
not
importnant
right now.
I am just
simple
sentecne
used to
test few
things.
Short (or not) explanation of "\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)" regex:
(lets just remember that in Java \ is not only special in regex, but also in String literals, so to use predefined character sets like \d we need to write it as "\\d" because we needed to escape that \ also in string literal)
\G - is anchor representing end of previously founded match, or if there is no match yet (when we just started searching) beginning of string (same as ^ does)
\s* - represents zero or more whitespaces (\s represents whitespace, * "zero-or-more" quantifier)
(.{1,"+maxLenght+"}) - lets split it in more parts (at runtime :maxLenght will hold some numeric value like 10 so regex will see it as .{1,10})
. represents any character (actually by default it may represent any character except line separators like \n or \r, but thanks to Pattern.DOTALL flag it can now represent any character - you may get rid of this method argument if you want to start splitting each sentence separately since its start will be printed in new line anyway)
{1,10} - this is quantifier which lets previously described element appear 1 to 10 times (by default will try to find maximal amout of matching repetitions),
.{1,10} - so based on what we said just now, it simply represents "1 to 10 of any characters"
( ) - parenthesis create groups, structures which allow us to hold specific parts of match (here we added parenthesis after \\s* because we will want to use only part after whitespaces)
(?=\\s|$) - is look-ahead mechanism which will make sure that text matched by .{1,10} will have after it:
space (\\s)
OR (written as |)
end of the string $ after it.
So thanks to .{1,10} we can match up to 10 characters. But with (?=\\s|$) after it we require that last character matched by .{1,10} is not part of unfinished word (there must be space or end of string after it).
Non-regex solution, just in case someone is more comfortable (?) not using regular expressions:
private String justify(String s, int limit) {
StringBuilder justifiedText = new StringBuilder();
StringBuilder justifiedLine = new StringBuilder();
String[] words = s.split(" ");
for (int i = 0; i < words.length; i++) {
justifiedLine.append(words[i]).append(" ");
if (i+1 == words.length || justifiedLine.length() + words[i+1].length() > limit) {
justifiedLine.deleteCharAt(justifiedLine.length() - 1);
justifiedText.append(justifiedLine.toString()).append(System.lineSeparator());
justifiedLine = new StringBuilder();
}
}
return justifiedText.toString();
}
Test:
String text = "Long sentence with spaces, and punctuation too. And supercalifragilisticexpialidocious words. No carriage returns, tho -- since it would seem weird to count the words in a new line as part of the previous paragraph's length.";
System.out.println(justify(text, 15));
Output:
Long sentence
with spaces,
and punctuation
too. And
supercalifragilisticexpialidocious
words. No
carriage
returns, tho --
since it would
seem weird to
count the words
in a new line
as part of the
previous
paragraph's
length.
It takes into account words that are longer than the set limit, so it doesn't skip them (unlike the regex version which just stops processing when it finds supercalifragilisticexpialidosus).
PS: The comment about all input words being expected to be shorter than the set limit, was made after I came up with this solution ;)

Remove Special Characters For A Pattern Java

I want to remove that characters from a String:
+ - ! ( ) { } [ ] ^ ~ : \
also I want to remove them:
/*
*/
&&
||
I mean that I will not remove & or | I will remove them if the second character follows the first one (/* */ && ||)
How can I do that efficiently and fast at Java?
Example:
a:b+c1|x||c*(?)
will be:
abc1|xc*?
This can be done via a long, but actually very simple regex.
String aString = "a:b+c1|x||c*(?)";
String sanitizedString = aString.replaceAll("[+\\-!(){}\\[\\]^~:\\\\]|/\\*|\\*/|&&|\\|\\|", "");
System.out.println(sanitizedString);
I think that the java.lang.String.replaceAll(String regex, String replacement) is all you need:
http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#replaceAll(java.lang.String, java.lang.String).
there is two way to do that :
1)
ArrayList<String> arrayList = new ArrayList<String>();
arrayList.add("+");
arrayList.add("-");
arrayList.add("||");
arrayList.add("&&");
arrayList.add("(");
arrayList.add(")");
arrayList.add("{");
arrayList.add("}");
arrayList.add("[");
arrayList.add("]");
arrayList.add("~");
arrayList.add("^");
arrayList.add(":");
arrayList.add("/");
arrayList.add("/*");
arrayList.add("*/");
String string = "a:b+c1|x||c*(?)";
for (int i = 0; i < arrayList.size(); i++) {
if (string.contains(arrayList.get(i)));
string=string.replace(arrayList.get(i), "");
}
System.out.println(string);
2)
String string = "a:b+c1|x||c*(?)";
string = string.replaceAll("[+\\-!(){}\\[\\]^~:\\\\]|/\\*|\\*/|&&|\\|\\|", "");
System.out.println(string);
Thomas wrote on How to remove special characters from a string?:
That depends on what you define as special characters, but try
replaceAll(...):
String result = yourString.replaceAll("[-+.^:,]","");
Note that the ^ character must not be the first one in the list, since
you'd then either have to escape it or it would mean "any but these
characters".
Another note: the - character needs to be the first or last one on the
list, otherwise you'd have to escape it or it would define a range (
e.g. :-, would mean "all characters in the range : to ,).
So, in order to keep consistency and not depend on character
positioning, you might want to escape all those characters that have a
special meaning in regular expressions (the following list is not
complete, so be aware of other characters like (, {, $ etc.):
String result = yourString.replaceAll("[\\-\\+\\.\\^:,]","");
If you want to get rid of all punctuation and symbols, try this regex:
\p{P}\p{S} (keep in mind that in Java strings you'd have to escape
back slashes: "\p{P}\p{S}").
A third way could be something like this, if you can exactly define
what should be left in your string:
String result = yourString.replaceAll("[^\\w\\s]","");
Here's less restrictive alternative to the "define allowed characters"
approach, as suggested by Ray:
String result = yourString.replaceAll("[^\\p{L}\\p{Z}]","");
The regex matches everything that is not a letter in any language and
not a separator (whitespace, linebreak etc.). Note that you can't use
[\P{L}\P{Z}] (upper case P means not having that property), since that
would mean "everything that is not a letter or not whitespace", which
almost matches everything, since letters are not whitespace and vice
versa.

How to count the number of sub string from a split in java

I have this piece of code in my java class
mystring = mysuperstring.split("/");
I want to know how many sub-string is created from the split.
In normal situation, if i want to access the first sub-string i just write
mystring[0];
Also, i want to know if mystring[5] exist or not.
I want to know how many sub-string is created from the split.
Since mystring is an array, you can simply use mystring.length to get the number of substrings.
Also, i want to know if mystring[5] exist or not.
To do this:
if (mystring.length >= 6) { ... }
mystring = mysuperstring.split("/");
int size = mystring.length;
remember that arrays are zero indexed, so where length = 5, the last element will be indexed with 4.
It's a simple way to count the sub-strings
word.split('/').length;
You can see an example of this implementation here.
Try this one & tell me if it works.
import java.util.regex.Pattern;
public class CountSubstring {
public static int countSubstring(String subStr, String str){
// the result of split() will contain one more element than the delimiter
// the "-1" second argument makes it not discard trailing empty strings
return str.split(Pattern.quote(subStr), -1).length - 1;
}
public static void main(String[] args){
System.out.println(countSubstring("th", "the three truths"));
System.out.println(countSubstring("abab", "ababababab"));
System.out.println(countSubstring("a*b", "abaabba*bbaba*bbab"));
}
}
I think you must read about split and array, please find the links.
if you read about split function it returns Array of String.
and now you should read about Array and its size
I wanted to demo a dumy token validation. So I tried to split by dot (.). It did not worked. So, I wonder and checked out an hour with no luck. After a while, when randomly trying I added escape character before dot and thank god it worked :D.
The reason is split takes string as regex. I checked why when writing an answer here:
You are splitting on the regex ., which means "any character"
int len = accessToken.split("\\.").length;
String I wanted to check
String accessToken = Bearer 1111.1111.11111 // demonstration purpose dumy
Output:
3

Categories