Splitting string N into N/X strings

Splitting string N into N/X strings - java

I would like some guidance on how to split a string into N number of separate strings based on a arithmetical operation; for example string.length()/300.
I am aware of ways to do it with delimiters such as
testString.split(",");
but how does one uses greedy/reluctant/possessive quantifiers with the split method?
Update: As per request a similar example of what am looking to achieve;
String X = "32028783836295C75546F7272656E745C756E742E657865000032002E002E005C0"
Resulting in X/3 (more or less... done by hand)
X[0] = 32028783836295C75546F
X[1] = 6E745C756E742E6578650
x[2] = 65000032002E002E005C0
Dont worry about explaining how to put it into the array, I have no problem with that, only on how to split without using a delimiter, but an arithmetic operation

You could do that by splitting on (?<=\G.{5}) whereby the string aaaaabbbbbccccceeeeefff would be split into the following parts:
aaaaa
bbbbb
ccccc
eeeee
fff
The \G matches the (zero-width) position where the previous match occurred. Initially, \G starts at the beginning of the string. Note that by default the . meta char does not match line breaks, so if you want it to match every character, enable DOT-ALL: (?s)(?<=\G.{5}).
A demo:
class Main {
public static void main(String[] args) {
int N = 5;
String text = "aaaaabbbbbccccceeeeefff";
String[] tokens = text.split("(?<=\\G.{" + N + "})");
for(String t : tokens) {
System.out.println(t);
}
}
}
which can be tested online here: http://ideone.com/q6dVB
EDIT
Since you asked for documentation on regex, here are the specific tutorials for the topics the suggested regex contains:
\G, see: http://www.regular-expressions.info/continue.html
(?<=...), see: http://www.regular-expressions.info/lookaround.html
{...}, see: http://www.regular-expressions.info/repeat.html

If there's a fixed length that you want each String to be, you can use Guava's Splitter:
int length = string.length() / 300;
Iterable<String> splitStrings = Splitter.fixedLength(length).split(string);
Each String in splitStrings with the possible exception of the last will have a length of length. The last may have a length between 1 and length.
Note that unlike String.split, which first builds an ArrayList<String> and then uses toArray() on that to produce the final String[] result, Guava's Splitter is lazy and doesn't do anything with the input string when split is called. The actual splitting and returning of strings is done as you iterate through the resulting Iterable. This allows you to just iterate over the results without allocating a data structure and storing them all or to copy them into any kind of Collection you want without going through the intermediate ArrayList and String[]. Depending on what you want to do with the results, this can be considerably more efficient. It's also much more clear what you're doing than with a regex.

How about plain old String.substring? It's memory friendly (as it reuses the original char array).

well, I think this is probably as efficient a way to do this as any other.
int N=300;
int sublen = testString.length()/N;
String[] subs = new String[N];
for(int i=0; i<testString.length(); i+=sublen){
subs[i] = testString.substring(i,i+sublen);
}
You can do it faster if you need the items as a char[] array rather as individual Strings - depending on how you need to use the results - e.g. using testString.toCharArray()

Dunno, you'll probably need a method that takes string and int times and returns a list of strings. Pseudo code (haven't checked if it works or not):
public String[] splintInto(String splitString, int parts)
{
int dlength = splitString.length/parts
ArrayList<String> retVal = new ArrayList<String>()
for(i=0; i<splitString.length;i+=dlength)
{
retVal.add(splitString.substring(i,i+dlength)
}
return retVal.toArray()
}

Related

Android String.split("") returning extra element

I am trying to split a word into its individual letters.
I tried both String.split("") and String.split("|") however when I split a word it is creating a extra empty element.
Example:
word = "word";
int n = word.length();
Log.i("20",Integer.toString(n));
String[] letters = word.split("|");
Log.i("25",Integer.toString(letters.length));
The output in the Android Monitor is:
07-21 15:50:23.084 5711-5711/com.strizhevskiy.movetester I/20: 4
07-21 15:50:23.085 5711-5711/com.strizhevskiy.movetester I/25: 5
I put the individual letters into TextView blocks and I can actually see an extra empty TextView.
When I test these methods in my regular Java it outputs the expected answer: 4.
I am almost tempted to think this is an actual bug in Android's implementation of the method.

I am thinking you want to do this:
public Character[] toCharacterArray( String s ) {
if ( s == null ) {
return null;
}
int len = s.length();
Character[] array = new Character[len];
for (int i = 0; i < len ; i++) {
array[i] = new Character(s.charAt(i));
}
return array;
}
Instead of splitting a word without delimiters?
I hope this helps!

It's hard to say if it's bug or expected behavior, because what are you doing doesn't make sense. You are trying to split string with logical OR (split is waiting for Regular expression, not just a string), so as result it could be different result in Android comparing with normal java, and I don't see there any issue.
Anyway, there is many ways to achieve what you want in a normal way, e.g. just iterating over word by each char in a cycle or just use toCharArray String's method.

Thank you for the suggestions. My current work-around is to use a mock array and copying over into a fresh array using System.arraycopy().
String[] mockLetters = word.split("");
int n = word.length();
String[] letters = new String[n];
System.arraycopy(mockLetters,1,letters,0,n);
I appreciate the suggestions to use toCharArray(). However, these letters then get put into TextViews and TextView doesnt seem to accept char. I could, of coarse, make it work but I've decided to stick with what I currently have.
Tom, in a comment to my question, answered my underlying issue:
Why String.split() worked differently in Android than it does in Java?
Apparently the rules for String.split() changed with Java 8.

Try passing a 0 as the limit per the documentation below so that the trailing spaces are discarded.
String[] split (String regex,
int limit)
If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

How can i extract specific terms from string lines in Java?

I have a serious problem with extracting terms from each string line. To be more specific, I have one csv formatted file which is actually not csv format (it saves all terms into line[0] only)
So, here's just example string line among thousands of string lines:
(split() doesn't work.!!! )
test.csv
"31451 CID005319044 　　15939353　　 C8H14O3S2 　　　beta-lipoic acid　　 C1C[S#](=O)S[C##H]1CCCCC(=O)O "
"12232 COD05374044 23439353　　C924O3S2 　　　saponin　　 CCCC(=O)O "
"9048 　 CTD042032　23241　　C3HO4O3S2　Berberine　 [C##H]1CCCCC(=O)O "
I want to extract "beta-lipoic acid" ,"saponin" and "Berberine" only which is located in 5th position.
You can see there are big spaces between terms, so that's why I said 5th position.
In this case, how can I extract terms located in 5th position for each line?
One more thing: the length of whitespace between each of the six terms is not always equal. the length could be one, two, three, four, or five, or something like that.
Because the length of whitespace is random, I can not use the .split() function.
For example, in the first line I would get "beta-lipoic" instead "beta-lipoic acid.**

Here is a solution for your problem using the string split and index of,
import java.util.ArrayList;
public class StringSplit {
public static void main(String[] args) {
String[] seperatedStr = null;
int fourthStrIndex = 0;
String modifiedStr = null, finalStr = null;
ArrayList<String> strList = new ArrayList<String>();
strList.add("31451 CID005319044 　　15939353　　 C8H14O3S2 beta-lipoic acid C1C[S#](=O)S[C##H]1CCCCC(=O)O ");
strList.add("12232 COD05374044 23439353 C924O3S2 saponin CCCC(=O)O ");
strList.add("9048 CTD042032 23241 C3HO4O3S2 Berberine [C##H]1CCCCC(=O)O ");
for (String item: strList) {
seperatedStr = item.split("\\s+");
fourthStrIndex = item.indexOf(seperatedStr[3]) + seperatedStr[3].length();
modifiedStr = item.substring(fourthStrIndex, item.length());
finalStr = modifiedStr.substring(0, modifiedStr.indexOf(seperatedStr[seperatedStr.length - 1]));
System.out.println(finalStr.trim());
}
}
}
Output:
beta-lipoic acid
saponin
Berberine

Option 1 : Use spring.split and check for multiple consecutive spaces. Like the code below:
String s[] = str.split("\\s\\s+");
for (String string : s) {
System.out.println(string);
}
Option 2 : Implement your own string split logic by browsing through all the characters. Sample code below (This code is just to give an idea. I didnot test this code.)
public static List<String> getData(String str) {
List<String> list = new ArrayList<>();
String s="";
int count=0;
for(char c : str.toCharArray()){
System.out.println(c);
if (c==' '){
count++;
}else {
s = s+c;
}
if(count>1&&!s.equalsIgnoreCase("")){
list.add(s);
count=0;
s="";
}
}
return list;
}

This would be a relatively easy fix if it weren't for beta-lipoic acid...
Assuming that only spaces/tabs/other whitespace separate terms, you could split on whitespace.
Pattern whitespace = Pattern.compile("\\s+");
String[] terms = whitespace.split(line); // Not 100% sure of syntax here...
// Your desired term should be index 4 of the terms array
While this would work for the majority of your terms, this would also result in you losing the "acid" in "beta-lipoic acid"...
Another hacky solution would be to add in a check for the 6th spot in the array produced by the above code and see if it matches English letters. If so, you can be reasonably confident that the 6th spot is actually part of the same term as the 5th spot, so you can then concatenate those together. This falls apart pretty quickly though if you have terms with >= 3 words. So something like
Pattern possibleEnglishWord = Pattern.compile([[a-zA-Z]*); // Can add dashes and such as needed
if (possibleEnglishWord.matches(line[5])) {
// return line[4].append(line[5]) or something like that
}
Another thing you can try is to replace all groups of spaces with a single space, and then remove everything that isn't made up of just english letters/dashes
line = whitespace.matcher(line).replaceAll("");
Pattern notEnglishWord = Pattern.compile("^[a-zA-Z]*"); // The syntax on this is almost certainly wrong
notEnglishWord.matcher(line).replaceAll("");
Then hopefully the only thing that is left would be the term you're looking for.
Hopefully this helps, but I do admit it's rather convoluted. One of the issues is that it appears that non-term words may have only one space between them, which would fool Option 1 as presented by Hirak... If that weren't the case that option should work.
Oh by the way, if you do end up doing this, put the Pattern declarations outside of any loops. They only need to be created once.

Exporting specific pattern of string using split method in a most efficient way

I want to export pattern of bit stream in a String varilable. Assume our bit stream is something like bitStream="111000001010000100001111". I am looking for a Java code to save this bit stream in a specific array (assume bitArray) in a way that all continous "0"s or "1"s be saved in one array element. In this example output would be somethins like this:
bitArray[0]="111"
bitArray[1]="00000"
bitArray[2]="1"
bitArray[3]="0"
bitArray[4]="1"
bitArray[5]="0000"
bitArray[6]="1"
bitArray[7]="0000"
bitArray[8]="1111"
I want to using bitArray to calculate the number of bit which is stored in each continous stream. For example in this case the final output would be, "3,5,1,1,1,4,1,4,4". I figure it out that probably "split" method would solve this for me. But I dont know what splitting pattern would do that for me, if i Using bitStream.split("1+") it would split on contious "1" pattern, if i using bitStream.split("0+") it will do that base on continous"0" but how it could be based on both?
Mathew suggested this solution and it works:
var wholeString = "111000001010000100001111";
wholeString = wholeString.replace('10', '1,0');
wholeString = wholeString.replace('01', '0,1');
stringSplit = wholeString.split(',');
My question is "Is this solution the most efficient one?"

Try replacing any occurrence of "01" and "10" with "0,1" and "1,0" respectively. Then once you've injected the commas, split the string using the comma as the delimiting character.
String wholeString = "111000001010000100001111"
wholeString = wholeString.replace("10", "1,0");
wholeString = wholeString.replace("01", "0,1");
String stringSplit[] = wholeString.split(",");

You can do this with a simple regular expression. It matches 1s and 0s and will return each in the order they occur in the stream. How you store or manipulate the results is up to you. Here is some example code.
String testString = "111000001010000100001111";
Pattern pattern = Pattern.compile("1+|0+");
Matcher matcher = pattern.matcher(testString);
while (matcher.find())
{
System.out.print(matcher.group().length());
System.out.print(" ");
}
This will result in the following output:
3 5 1 1 1 4 1 4 4
One option for storing the results is to put them in an ArrayList<Integer>
Since the OP wanted most efficient, I did some tests to see how long each answer takes to iterate over a large stream 10000 times and came up with the following results. In each test the times were different but the order of fastest to slowest remained the same. I know tick performance testing has it's issues like not accounting for system load but I just wanted a quick test.
My answer completed in 1145 ms
Alessio's answer completed in 1202 ms
Matthew Lee Keith's answer completed in 2002 ms
Evgeniy Dorofeev's answer completed in 2556 ms
Hope this helps

I won't give you a code, but I'll guide you to a possible solution:
Construct an ArrayList<Integer>, iterate on the array of bits, as long as you have 1's, increment a counter and as soon as you have 0, add the counter to the ArrayList. After this procedure, you'll have an ArrayList that contain numbers, etc: [1,2,2,3,4] - Representing a serieses of 1's and 0's.
This will represent the sequences of 1's and 0's. Then you construct an array of the size of the ArrayList, and fill it accordingly.
The time complexity is O(n) because you need to iterate on the array only once.

This code works for any String and patterns, not only 1s and 0s. Iterate char by char, and if the current char is equal to the previous one, append the last char to the last element of the List, otherwise create a new element in the list.
public List<String> getArray(String input){
List<String> output = new ArrayList<String>();
if(input==null || input.length==0) return output;
int count = 0;
char [] inputA = input.toCharArray();
output.add(inputA[0]+"");
for(int i = 1; i <inputA.length;i++){
if(inputA[i]==inputA[i-1]){
String current = output.get(count)+inputA[i];
output.remove(count);
output.add(current);
}
else{
output.add(inputA[i]+"");
count++;
}
}
return output;
}

try this
String[] a = s.replaceAll("(.)(?!\\1)", "$1,").split(",");

I tried to implement #Maroun Maroun solution.
public static void main(String args[]){
long start = System.currentTimeMillis();
String bitStream ="0111000001010000100001111";
int length = bitStream.length();
char base = bitStream.charAt(0);
ArrayList<Integer> counts = new ArrayList<Integer>();
int count = -1;
char currChar = ' ';
for (int i=0;i<length;i++){
currChar = bitStream.charAt(i);
if (currChar == base){
count++;
}else {
base = currChar;
counts.add(count+1);
count = 0;
}
}
counts.add(count+1);
System.out.println("Time taken :" + (System.currentTimeMillis()-start ) +"ms");
System.out.println(counts.toString());
}
I believe it is more effecient way, as he said it is O(n) , you are iterating only once. Since the goal to get the count only not to store it as array. i woul recommen this. Even if we use Regular Expression ( internal it would have to iterate any way )
Result out put is
Time taken :0ms
[1, 3, 5, 1, 1, 1, 4, 1, 4, 4]

Try this one:
String[] parts = input.split("(?<=1)(?=0)|(?<=0)(?=1)");
See in action here: http://rubular.com/r/qyyfHNAo0T

Determining if a given string of words has words greater than 5 letters long

So, I'm in need of help on my homework assignment. Here's the question:
Write a static method, getBigWords, that gets a String parameter and returns an array whose elements are the words in the parameter that contain more than 5 letters. (A word is defined as a contiguous sequence of letters.) So, given a String like "There are 87,000,000 people in Canada", getBigWords would return an array of two elements, "people" and "Canada".
What I have so far:
public static getBigWords(String sentence)
{
String[] a = new String;
String[] split = sentence.split("\\s");
for(int i = 0; i < split.length; i++)
{
if(split[i].length => 5)
{
a.add(split[i]);
}
}
return a;
}
I don't want an answer, just a means to guide me in the right direction. I'm a novice at programming, so it's difficult for me to figure out what exactly I'm doing wrong.
EDIT:
I've now modified my method to:
public static String[] getBigWords(String sentence)
{
ArrayList<String> result = new ArrayList<String>();
String[] split = sentence.split("\\s+");
for(int i = 0; i < split.length; i++)
{
if(split[i].length() > 5)
{
if(split[i].matches("[a-zA-Z]+"))
{
result.add(split[i]);
}
}
}
return result.toArray(new String[0]);
}
It prints out the results I want, but the online software I use to turn in the assignment, still says I'm doing something wrong. More specifically, it states:
Edith de Stance states:
⇒     You might want to use: +=
⇒     You might want to use: ==
⇒     You might want to use: +
not really sure what that means....

The main problem is that you can't have an array that makes itself bigger as you add elements.
You have 2 options:
ArrayList (basically a variable-length array).
Make an array guaranteed to be bigger.
Also, some notes:
The definition of an array needs to look like:
int size = ...; // V- note the square brackets here
String[] a = new String[size];
Arrays don't have an add method, you need to keep track of the index yourself.
You're currently only splitting on spaces, so 87,000,000 will also match. You could validate the string manually to ensure it consists of only letters.
It's >=, not =>.
I believe the function needs to return an array:
public static String[] getBigWords(String sentence)
It actually needs to return something:
return result.toArray(new String[0]);
rather than
return null;
The "You might want to use" suggestions points to that you might have to process the array character by character.

First, try and print out all the elements in your split array. Remember, you do only want you look at words. So, examine if this is the case by printing out each element of the split array inside your for loop. (I'm suspecting you will get a false positive at the moment)
Also, you need to revisit your books on arrays in Java. You can not dynamically add elements to an array. So, you will need a different data structure to be able to use an add() method. An ArrayList of Strings would help you here.

split your string on bases of white space, it will return an array. You can check the length of each word by iterating on that array.
you can split string though this way myString.split("\\s+");

Try this...
public static String[] getBigWords(String sentence)
{
java.util.ArrayList<String> result = new java.util.ArrayList<String>();
String[] split = sentence.split("\\s+");
for(int i = 0; i < split.length; i++)
{
if(split[i].length() > 5)
{
if(split[i].matches("[a-zA-Z]+"))
{
result.add(split[i]);
}
if (split[i].matches("[a-zA-Z]+,"))
{
String temp = "";
for(int j = 0; j < split[i].length(); j++)
{
if((split[i].charAt(j))!=((char)','))
{
temp += split[i].charAt(j);
//System.out.print(split[i].charAt(j) + "|");
}
}
result.add(temp);
}
}
}
return result.toArray(new String[0]);
}

Whet you have done is correct but you can't you add method in array. You should set like a[position]= spilt[i]; if you want to ignore number then check by Float.isNumber() method.

Your logic is valid, but you have some syntax issues. If you are not using an IDE like Eclipse that shows you syntax errors, try commenting out lines to pinpoint which ones are syntactically incorrect. I want to also tell you that once an array is created its length cannot change. Hopefully that sets you off in the right directions.

Apart from syntax errors at String array declaration should be like new String[n]
and add method will not be there in Array hence you should use like
a[i] = split[i];
You need to add another condition along with length condition to check that the given word have all letters this can be done in 2 ways
first way is to use Character.isLetter() method and second way is create regular expression
to check string have only letter. google it for regular expression and use matcher to match like the below
Pattern pattern=Pattern.compile();
Matcher matcher=pattern.matcher();
Final point is use another counter (let say j=0) to store output values and increment this counter as and when you store string in the array.
a[j++] = split[i];

I would use a string tokenizer (string tokenizer class in java)
Iterate through each entry and if the string length is more than 4 (or whatever you need) add to the array you are returning.
You said no code, so... (This is like 5 lines of code)

which code is more efficient?

which of the following is an efficient way to reverse words in a string ?
public String Reverse(StringTokenizer st){
String[] words = new String[st.countTokens()];
int i = 0;
while(st.hasMoreTokens()){
words[i] = st.nextToken();i++}
for(int j = words.length-1;j--)
output = words[j]+" ";}
OR
public String Reverse(StringTokenizer st, String output){
if(!st.hasMoreTokens()) return output;
output = st.nextToken()+" "+output;
return Reverse(st, output);}
public String ReverseMain(StringTokenizer st){
return Reverse(st, "");}
while the first way seems more readable and straight forward, there are two loops in it. In the 2nd method, I've tried doing it in tail-recursive way. But I am not sure whether java does optimize tail-recursive code.

you could do this in just one loop
public String Reverse(StringTokenizer st){
int length = st.countTokens();
String[] words = new String[length];
int i = length - 1;
while(i >= 0){
words[i] = st.nextToken();i--}
}

But I am not sure whether java does optimize tail-recursive code.
It doesn't. Or at least the Sun/Oracle Java implementations don't, up to and including Java 7.
References:
"Tail calls in the VM" by John Rose # Oracle.
Bug 4726340 - RFE: Tail Call Optimization
I don't know whether this makes one solution faster than the other. (Test it yourself ... taking care to avoid the standard micro-benchmarking traps.)
However, the fact that Java doesn't implement tail-call optimization means that the 2nd solution is liable to run out of stack space if you give it a string with a large (enough) number of words.
Finally, if you are looking for a more space efficient way to implement this, there is clever way that uses just a StringBuilder.
Create a StringBuilder from your input String
Reverse the characters in the StringBuilder using reverse().
Step through the StringBuilder, identifying the start and end offset of each word. For each start/end offset pair, reverse the characters between the offsets. (You have to do this using a loop.)
Turn the StringBuilder back into a String.

You can test results by timing both of them on a large amount of results
eg. You reverse 100000000 strings and see how many seconds it takes. You could also compare start and end system timestamps to get the exact difference between the two functions.

StringTokenizer is not deprecated but if you read the current JavaDoc...
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
String[] strArray = str.split(" ");
StringBuilder sb = new StringBuilder();
for (int i = strArray.length() - 1; i >= 0; i--)
sb.append(strArray[i]).append(" ");
String reversedWords = sb.substring(0, sb.length -1) // strip trailing space

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Splitting string N into N/X strings - java

How about plain old String.substring? It's memory friendly (as it reuses the original char array).

Related

Android String.split("") returning extra element

How can i extract specific terms from string lines in Java?

Exporting specific pattern of string using split method in a most efficient way

Determining if a given string of words has words greater than 5 letters long

which code is more efficient?

Categories

Resources