trying to add substring to array every third character - java

In short, I'm trying to take a string then divide it every three characters, adding those three characters to an array as it progresses.
The initial input to the function (rawData) look something like this:
ATGCCACTATGGTAG
but can vary in length.
I'm trying to convert the above data (representing nucleotides) into the individual codons like so:
[ATG,CCA,CTA,TGG,TAG]
note that every chunk of 3 is now indexed into an array.
This is my code:
public static void codonList(String rawData) {
int previous = 0;
String[] codons = new String[rawData.length() / 3];
for (int i = 0; i < rawData.length(); i++) {
previous++;
// goes through each
// split at every third then append to end of codon string
if (previous % 3 == 0) {
String chunk = rawData.substring(previous - 3, previous);
codons[i] = chunk;
System.out.println(Arrays.toString(codons));
}
}
}
and its output:
[null,null,AGT,null,null]
I'm 90% sure it's a simple fix but can't seem to get it figured out. If someone can provide some insight that would be greatly appreciated!

You have several issues.
you are iterating through the whole string, you only want to create rawData.length()/3 codons.
As Kevin hinted, get 3 character chucks: String chunk = rawData.substring(i*3, i*3+3);
only print the array when you are done processing the string.
public static void codonList(String rawData) {
String[] codons = new String[rawData.length() / 3];
for (int i = 0; i < rawData.length() / 3; i++) {
// goes through each
// split at every third then append to end of codon string
String chunk = rawData.substring(i*3, i*3+3);
codons[i] = chunk;
}
System.out.println(Arrays.toString(codons));
}
Output:
[ATG, CCA, CTA, TGG, TAG]

Related

Having problems with splitting a String into max 1Mb size subStrings

I have to split a String into 1Mb size strings. With using UTF-8 as character encoding, some letters take up more than 1 byte, so for avoiding to split a character in the middle (for example 'á' is 2 byte, so can't 1 byte go to the end of one String, and 1 to the beggining of the next String)
public static List<String> cutString3(String original, int chunkSize, String encoding) throws UnsupportedEncodingException {
List<String> strings = new ArrayList<>();
final int end = original.length();
int from = 0;
int to = 0;
do {
to = (to + chunkSize > end) ? end : to + chunkSize;
String chunk = original.substring(from, to); // get chunk
while (chunk.getBytes(encoding).length > chunkSize) { // cut the chunk from the end
chunk = original.substring(from, --to);
}
strings.add(chunk); // add chunk to collection
from = to; // next chunk
} while (to < end);
return strings;
}
I'm using the above method to generate an example String:
private static String createDataSize(int msgSize) {
StringBuilder sb = new StringBuilder(msgSize);
for (int i = 0; i < msgSize; i++) {
sb.append("a");
}
return sb.toString();
}
Calling the method as the following:
String exampleString = createDataSize(1024*1024*3);
cutString(exampleString, 1024*1024, "UTF-8");
It has no problems, I get back 3 Strings, as the 3 megabyte String was splitted into 3, 1 megabyte String. But if I change the createDataSize() method's char to append 'á' to the example String, so it only stands from "áááááá..." the inner while loop in the cutString method takes forever, since it's removing every 'á' one by one, until it fits into the given size. How can I improve the inner while, or come up with something similiar solution? The String can be smaller than 1 megabyte, just not bigger!
Using the binary search logic would clearly fit your need.
Simply decrement faster, using only the half of the chunk size, if you still as some room, add an half of it, if not, remove and half. And so on.
A simpler solution would be to remove only the differences between chunk.getBytes(encoding).length and chunkSize. Then see how many byte you can still take if you want to fill it completly.

How to Take every three letters from a list and saving it as an array?

How could I have every three letters of this string: ATGCCACTATGGTAG to be saved in an array, with a comma separating every three letters.
This is what I have:
The parameter sequence is the above jumble of letters.
public static void listCodons(String sequence){
int length = sequence.length();
String[] listOfCodons = new String[length];
for(int i = 0; i< length; i++){
//This is where I'm not sure what to do
listOfCodons = sequence[i]+sequence[i+1]+sequence[i+2];
}
}
System.out.print(Arrays.toString(listOfCodons));
}
First of all, this is syntactically incorrect code, there are many errors. The first one is two } after for loop, second one detonating end of method, so the System.out.print after that is out of scope of listCodons method, hence it's an error.
The second mistake is in java, characters in string can't be accessed with [index], you have to use .charAt(index) instead.
If I understood your problem correctly, you want your parameter ATGCCACTATGGTAG to become an array of ["ATG", "CCA", "CTA", "TGG", "TAG"]. If that's the case, here is the solution, with fixed problems:
public static void listCodons(String sequence) {
int length = sequence.length();
String[] listOfCodons = new String[length / 3];
for (int i = 0; i < listOfCodons.length; i++) {
listOfCodons[i] = sequence.substring(i * 3, i * 3 + 3);
}
System.out.print(Arrays.toString(listOfCodons));
}
Use 1/3 of length of sting, since we take 3 characters at a time. i * 3 is current position at string times 3 to get to correct beginning of the tag and i * 3 + 3 is to take next 3 characters from this position. Only problem here is if the tag length isn't consistent, but there is no information about that.
More smarter and easier to read is the way with the Guava Libraries
That's the code:
public static void listCodons(String sequence)
{
Iterable<String> pieces = Splitter.fixedLength(3).split(sequence);
// note this is Java 8 Lambda Syntax
pieces.forEach(pi -> System.out.println(pi));
// If you can not use Java 8 take the following:
for (String piece : pieces)
{
System.out.println(piece);
}
}
Let's assume that the length of your sequence is 9, and you want to group every three characters.
This gives you the following grouping:
characters 0, 1, and 2;
characters 3, 4, and 5;
characters 6, 7, and 8.
Notice how the first character in each group is indexed by a multiple of 3.
You can use this idea in your code.
public static void listCodons(String sequence){
/** If your sequence has n characters, then
/* dividing it into groups of three results in
/* n/3 groups. */
int length = sequence.length();
String[] listOfCodons = new String[length/3];
for(int i = 0; i< length/3; i++) {
/** Remember: the first character of each codon
/* is indexed in the sequence by a multiple
/* of three. */
listOfCodons[i] = sequence.substring(3*i, 3*(i+1));
}
System.out.print(Arrays.toString(listOfCodons));
}
So what is going on?
Let's use your original example ATGCCACTATGGTAG.
The sequence has a length of 15.
Thus, we will get in the end 15/3=5 codons.
Each codon i will consist of the characters indexed by 3*i, 3*i+1, and 3*i+2.
So you take the corresponding substring and add it to the list of codons.
Hope this helps.
Based on the link posted by #shikjohari answer on SO you could solve it like
public class Split {
public static void listCodons(String sequence) {
String[] listOfCodons = sequence.split("(?<=\\G...)");
System.out.print(Arrays.toString(listOfCodons));
}
public static void main(String[] args) throws Exception {
listCodons("ATGCCACTATGGTAG");
}
}
Explanation copied from the mentioned link: The regex (?<=\G...) matches an empty string that has the last match (\G) followed by three characters (...) before it ((?<= )).
Try this code :
public static void listCodons(String sequence){
int length = (sequence.length()/3);
String[] listOfCodons = new String[length];
int j=0,i=0;
while(i<sequence.length() && j<listOfCodons.length){
listOfCodons[j++]=sequence.substring(i, i+3);
i+=3;
}
System.out.println(Arrays.toString(listOfCodons));
}

Need to store every other character of a string, into another

public class newString {
public static void main (String args[]){
String title = "Book";
String title1;
title1 = title;
for(int i = 0; i < title.length(); i++){
for (int x = 0; x<title1.length(); x++){
if (title.charAt(i+x) == title1.charAt(x)){
System.out.print(title.charAt(0,1));
}
}
}
}
}
I really don't understand what I'm doing wrong here. What I need to do is define a string called "title", with "Book" in it, which I did, and create a second string called "title1". I need to create code to store the contents of title, into title1, but only every other character. For example: title1 should have "Bo" in it. What am I doing wrong?
Here's the looping solution with fewer operations. Instead of checking if i is even, just increment by 2.
String title1 = "Some title";
String title2 = "";
for (int i = 0; i < title1.length(); i += 2)
{
title2 += title1.charAt(i);
}
You algorithm is wrong, it seems what you need to do is to extract out every nth character from source string, for example:
String source = "Book";
End result should be "Bo"
The algorithm should be:
Iterate through each character in the original string, use a stride as you need, in this case, a stride of 2 should do (so rather than increment by one, increment by the required stride)
Take the character at that index and add it to your second string
The end result should be a string which holds every nth character.
I don't really understand what you are attempting, but I can tell you what you are doing. Your loop structure does the following:
when i = 0, it compares all characters in both strings (0 + n = n, so the inner loop goes from x - title1.length()).
when i = 1, compare all characters except the first one (for size x, 1 + n = x - 1 comparisons).
when i =2, compare x / 2 characters (for size x, 2 + n = x / 2)
when i = 3, compare x / 3 characters
... and so on
System.out.print(title.charAt(0,1)) Shouldn't even compile. charAt(int) is the correct call. And if title length is greater than 0, this will always print a String with a single character -- the first one in title. And it will always be the same unless you reassign title to a different String.
Also this code will always throw an IndexOutOfBoundsException at title.charAt(i+x) when i = title.length() - 1 and x > 0.

Remove chars from string in Java from file

How would I remove the chars from the data in this file so I could sum up the numbers?
Alice Jones,80,90,100,95,75,85,90,100,90,92
Bob Manfred,98,89,87,89,9,98,7,89,98,78
I want to do this so for every line it will remove all the chars but not ints.
The following code might be useful to you, try running it once,
public static void main(String ar[])
{
String s = "kasdkasd,1,2,3,4,5,6,7,8,9,10";
int sum=0;
String[] spl = s.split(",");
for(int i=0;i<spl.length;i++)
{
try{
int x = Integer.parseInt(spl[i]);
sum = sum + x;
}
catch(NumberFormatException e)
{
System.out.println("error parsing "+spl[i]);
System.out.println("\n the stack of the exception");
e.printStackTrace();
System.out.println("\n");
}
}
System.out.println("The sum of the numbers in the string : "+ sum);
}
even the String of the form "abcd,1,2,3,asdas,12,34,asd" would give you sum of the numbers
You need to split each line into a String array and parse the numbers starting from index 1
String[] arr = line.split(",");
for(int i = 1; i < arr.length; i++) {
int n = Integer.parseInt(arr[i]);
...
try this:
String input = "Name,2,1,3,4,5,10,100";
String[] strings = input.split(",");
int result=0;
for (int i = 1; i < strings.length; i++)
{
result += Integer.parseInt(strings[i]);
}
You can make use of the split method of course, supplying "," as the parameter, but that's not all.
The trick is to put each text file's line into an ArrayList. Once you have that, move forwars the Pseudocode:
1) Put each line of the text file inside an ArrayList
2) For each line, Split to an array by using ","
3) If the Array's size is bigger than 1, it means there are numbers to be summed up, else only the name lies on the array and you should continue to the next line
4) So the size is bigger than 1, iterate thru the strings inside this String[] array generated by the Split function, from 1 to < Size (this will exclude the name string itself)
5) use Integer.parseInt( iterated number as String ) and sum it up
There you go
Number Format Exception would occur if the string is not a number but you are putting each line into an ArrayList and excluding the name so there should be no problem :)
Well, if you know that it's a CSV file, in this exact format, you could read the line, execute string.split(',') and then disregard the first returned string in the array of results. See Evgenly's answer.
Edit: here's the complete program:
class Foo {
static String input = "Name,2,1,3,4,5,10,100";
public static void main(String[] args) {
String[] strings = input.split(",");
int result=0;
for (int i = 1; i < strings.length; i++)
{
result += Integer.parseInt(strings[i]);
}
System.out.println(result);
}
}
(wow, I never wrote a program before that didn't import anything.)
And here's the output:
125
If you're not interesting in parsing the file, but just want to remove the first field; then split it, disregard the first field, and then rejoin the remaining fields.
String[] fields = line.split(',');
StringBuilder sb = new StringBuilder(fields[1]);
for (int i=2; i < fields.length; ++i)
sb.append(',').append(fields[i]);
line = sb.toString();
You could also use a Pattern (regular expression):
line = line.replaceFirst("[^,]*,", "");
Of course, this assumes that the first field contains no commas. If it does, things get more complicated. I assume the commas are escaped somehow.
There are a couple of CsvReader/Writers that might me helpful to you for handling CSV data. Apart from that:
I'm not sure if you are summing up rows? columns? both? in any case create an array of the target sum counters int[] sums(or just one int sum)
Read one row, then process it either using split(a bit heavy, but clear) or by parsing the line into numbers yourself (likely to generate less garbage and work faster).
Add numbers to counters
Continue until end of file
Loading the whole file before starting to process is a not a good idea as you are doing 2 bad things:
Stuffing the file into memory, if it's a large file you'll run out of memory (very bad)
Iterating over the data 2 times instead of one (probably not the end of the world)
Suppose, format of the string is fixed.
String s = "Alice Jones,80,90,100,95,75,85,90,100,90,92";
At first, I would get rid of characters
Matcher matcher = Pattern.compile("(\\d+,)+\\d+").matcher(s);
int sum = 0;
After getting string of integers, separated by a comma, I would split them into array of Strings, parse it into integer value and sum ints:
if (matcher.find()){
for (String ele: matcher.group(0).split(",")){
sum+= Integer.parseInt(ele);
}
}
System.out.println(sum);

Sudden slow-down and java.lang.OutOfMemoryError during Java string search

I am writing a program for pattern discovery in RNA sequences that mostly works. In order to find 'patterns' in the sequences, I am generating some possible patterns and scanning through the input file of all sequences for them (there's more to the algorithm, but this is the bit that is breaking). Possible patterns generated are of a specified length given by the user.
This works well for all sequence lengths up to 8 characters long. Then at 9, the program runs for an very long time, then gives a java.lang.OutOfMemoryError. After some debugging, I found that the weak point is the pattern generation method:
/* Get elementary pattern (ep) substrings, to later combine into full patterns */
public static void init_ep_subs(int length) {
ep_subs = new ArrayList<Substring>(); // clear static ep_subs data field
/* ep subs are of the form C1...C2...C3 where C1, C2, C3 are characters in the
alphabet and the whole length of the string is equal to the input parameter
'length'. The number of dots varies for different lengths.
The middle character C2 can occur instead of any dot, or not at all.*/
for (int i = 1; i < length-1; i++) { // for each potential position of C2
// for each alphabet character to be C1
for (int first = 0; first < alphabet.length; first++) {
// for each alphabet character to be C3
for (int last = 0; last < alphabet.length; last++) {
// make blank pattern, i.e. no C2
Substring s_blank = new Substring(-1, alphabet[first],
'0', alphabet[last]);
// get its frequency in the input string
s_blank.occurrences = search_sequences(s_blank.toString());
// if blank ep is found frequently enough in the input string, store it
if (s_blank.frequency()>=nP) ep_subs.add(s_blank);
// when C2 is present, for each character it could be
for (int mid = 0; mid < alphabet.length; mid++) {
// make pattern C1,C2,C3
Substring s = new Substring(i, alphabet[first],
alphabet[mid],
alphabet[last]);
// search input string for pattern s
s.occurrences = search_sequences(s.toString());
// if s is frequent enough, store it
if (s.frequency()>=nP) ep_subs.add(s);
}
}
}
}
}
Here's what happens: When I time the calls to search_sequences, they start out at around 40-100ms each and carry on that way for the first patterns. Then after a couple hundred patterns (around 'C.....G.C') those calls suddenly start to take about ten times as long, 1000-2000ms. After that, the times steadily increase until at about 12000ms ('C......TA') it gives this error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3209)
at java.lang.String.<init>(String.java:215)
at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
at java.nio.CharBuffer.toString(CharBuffer.java:1157)
at java.util.regex.Matcher.toMatchResult(Matcher.java:232)
at java.util.Scanner.match(Scanner.java:1270)
at java.util.Scanner.hasNextLine(Scanner.java:1478)
at PatternFinder4.search_sequences(PatternFinder4.java:217)
at PatternFinder4.init_ep_subs(PatternFinder4.java:256)
at PatternFinder4.main(PatternFinder4.java:62)
This is the search_sequences method:
/* Searches the input string 'sequences' for occurrences of the parameter string 'sub' */
public static ArrayList<int[]> search_sequences(String sub) {
/* arraylist returned holding int arrays with coordinates of the places where 'sub'
was found, i.e. {l,i} l = lines number, i = index within line */
ArrayList<int[]> occurrences = new ArrayList<int[]>();
s = new Scanner(sequences);
int line_index = 0;
String line = "";
while (s.hasNextLine()) {
line = s.nextLine();
pattern = Pattern.compile(sub);
matcher = pattern.matcher(line);
pattern = null; // all the =nulls were intended to help memory management, had no effect
int index = 0;
// for each occurrence of 'sub' in the line being scanned
while (matcher.find(index)) {
int start = matcher.start(); // get the index of the next occurrence
int[] occurrence = {line_index, start}; // make up the coordinate array
occurrences.add(occurrence); // store that occurrence
index = start+1; // start looking from after the last occurence found
}
matcher=null;
line=null;
line_index++;
}
s=null;
return occurrences;
}
I've tried the program on a couple of different computers of differing speeds, and while the actual times time complete search_sequence are smaller on faster computers, the relative times are the same; at around the same number of iterations, search_sequence starts taking ten times as long to complete.
I've tried googling about memory efficiency and speed of different input streams such as BufferedReader etc, but the general consensus seems to be that they are all roughly equivalent to Scanner. Do any of you have any advice about what this bug is or how I could try to figure it out myself?
If anyone wants to see any more of the code, just ask.
EDIT:
1 - The input file 'sequences' is 1000 protein sequences (each on one line) of varying lengths around a couple hundred characters. I should also mention this program will /only ever need to work/ up to patterns of length nine.
2 - Here are the Substring class methods used in the above code
static class Substring {
int residue; // position of the middle character C2
char front, mid, end; // alphabet characters for C1, C2 and C3
ArrayList<int[]> occurrences; // list of positions the substring occurs in 'sequences'
String string; // string representation of the substring
public Substring(int inresidue, char infront, char inmid, char inend) {
occurrences = new ArrayList<int[]>();
residue = inresidue;
front = infront;
mid = inmid;
end = inend;
setString(); // makes the string representation using characters and their positions
}
/* gets the frequency of the substring given the places it occurs in 'sequences'.
This only counts the substring /once per line ist occurs in/. */
public int frequency() {
return PatternFinder.frequency(occurrences);
}
public String toString() {
return string;
}
/* makes the string representation using the substring's characters and their positions */
private void setString() {
if (residue>-1) {
String left_mid = "";
for (int j = 0; j < residue-1; j++) left_mid += ".";
String right_mid = "";
for (int j = residue+1; j < length-1; j++) right_mid += ".";
string = front + left_mid + mid + right_mid + end;
} else {
String mid = "";
for (int i = 0; i < length-2; i++) mid += ".";
string = front + mid + end;
}
}
}
... and the PatternFinder.frequency method (called in Substring.frequency()) :
public static int frequency(ArrayList<int[]> occurrences) {
HashSet<String> lines_present = new HashSet<String>();
for (int[] occurrence : occurrences) {
lines_present.add(new String(occurrence[0]+""));
}
return lines_present.size();
}
What is alphabet? What kind of regexs are you giving it? Have you checked the number of occurrences you're storing? It's possible that simply storing the occurrences is enough to make it run out of memory, since you're doing an exponential number of searches.
It sounds like your algorithm has a hidden exponential resource usage. You need to rethink what you are trying to do.
Also, setting a local variable to null won't help since the JVM already does data flow and liveness analysis.
Edit: Here's a page that explains how even short regexes can take an exponential amount of time to run.
I can't spot an obvious memory leak, but your program does have a number of inefficiencies. Here are some recommendations:
Indent your code properly. It will make reading it, both for you and for others, much easier. In its current form it's very hard to read.
If you're referring to a member variable, prefix it with this., otherwise readers of code snippets won't know for sure what you're referring to.
Avoid static members and methods unless they're absolutely necessary. When referring to them, use the Classname.membername form, for the same reasons.
How is the code of frequency() different from just return occurrences.size()?
In search_sequences(), the regex string sub is a constant. You need to compile it only once, but you're recompiling it for every line.
Split the input string (sequences) into lines once and store them in an array or ArrayList. Don't re-split inside search_sequences(), pass the split collection in.
There are probably more things to fix, but this is the list that jumps out.
Fix all these and if you still have problems, you may need to use a profiler to find out what's happening.

Categories