I want to change this binary string "100110001" into "1 00 11 000 1".
I tried finding the answer to that and had no luck finding it. I've tried to approach this problem using split() method.
You can use split() but you need a regex that identifies the correct points to split. Afterward, you can combine the parts again with a space in between:
String input = "100110001";
String result = String. join(" ", input.split("(?<=(.))(?!\\1)"));
System.out.println(result);
Output:
1 00 11 000 1
Edit: The regex simply checks if the current character is not occurring again in the next position. If the character is not occurring back to back we want to split.
It can be done without need to resort to regular expressions by utilizing a plain for loop and StringBuilder in a single pass through the given string, i.e. in O(n) time.
This approach is more simple but a bit more verbose than regex-based solution. The overall performance is almost the same.
The logic:
cut out cases when the given string contains less than two characters;
declare a local variable prev that will store a character at the previous position and initialize it with the first character of the given string;
iterate though the given string and in every case when previous and next characters don't match append an empty space to the result.
The code might look like this:
public static String insertSpaces(String source) {
if (source.length() < 2) { // space can't be inserted
return source;
}
StringBuilder result = new StringBuilder();
char prev = source.charAt(0);
for (int i = 0; i < source.length(); i++) {
char next = source.charAt(i);
if (next != prev) {
result.append(" ");
prev = next;
}
result.append(next);
}
return result.toString();
}
main()
public static void main(String[] args) {
String source = "100110001";
System.out.println(insertSpaces(source));
}
output
1 00 11 000 1
Related
I have a code to remove duplicate words from a string. Lets say i have:
This is serious serious work. I apply the code and get: This is serious work
This is the code:
return Arrays.stream(input.split(" ")).distinct().collect(Collectors.joining(" "));
Now i want to add new constraints that is if the string/line is longer than 78 characters, break and indent it where it makes sense so the line does not run longer than 78 characters. Example:
This one is a very long line that runs off the right side because it is longer than 78 characters long
It should then be
This one is a very long line that runs off the right side because it is longer
than 78 characters long
I cant find a solution to this. It was brought to my attention that there is a possible duplicate to my question. I cant find my answer there. I need to be able to indent.
You could create a StringBuilder off of the String and then insert a newline and tab at the last word break after 78 characters. You can find the last word break to insert the newline/tab by getting the substring of the first 78 characters, and then finding the index of the last space:
StringBuilder sb = new StringBuilder(Arrays.stream(input.split(" ")).distinct().collect(Collectors.joining(" ")));
if(sb.length() > 78) {
int lastWordBreak = sb.substring(0, 78).lastIndexOf(" ");
sb.insert(lastWordBreak , "\n\t");
}
return sb.toString();
Output:
This one is a very long line that runs off the right side because it longer
than 78 characters
Also your Stream does not do what you want it to. Yes it removes duplicate words but.. it removes duplicate words. So for the String:
This is a great sentence. It is a great example.
It would remove the duplicate is, great and a, and return
This is a great sentence. It example.
To only remove consecutive duplicate words you can look at the following solution:
Removing consecutive duplicates words out of text using Regex and displaying the new text
Alternatively you could create your own them by splitting the text into words, and comparing the current element to the one ahead of it to remove the consecutive duplicate words
Instead of using
Collectors.joining(" ")
it is possible to write a custom collector that adds new lines and indentation at proper places.
Let's introduce a LineWrapper class, which contains indent and limit fields:
public class LineWrapper {
private final int limit;
private final String indent;
The default constructor sets the fields to reasonable default values.
Note how the indent starts with a new line character.
public LineWrapper() {
limit = 78;
indent = "\n ";
}
A custom constructor allows the client to specify limit and indent:
public LineWrapper(int limit, String indent) {
if (limit <= 0) {
throw new IllegalArgumentException("limit");
}
if (indent == null || !indent.matches("\\n *")) {
throw new IllegalArgumentException("indent");
}
this.limit = limit;
this.indent = indent;
}
Following is a regex used to split the input around one or more spaces. This makes sure that the split will not produce empty Strings:
private static final String SPACES = " +";
The apply method splits the input and collects the words into lines of the specified maximum length, indents the lines and removes duplicate consecutive words. Note how duplicates are not removed using the Stream.distinct method, since it also removes duplicates that are not consecutive.
public String apply(String input) {
return Arrays.stream(input.split(SPACES)).collect(toWrappedString());
}
The toWrappedString method returns a collector that accumulates the words in a new ArrayList, and uses the following methods:
addIfDistinct: to add the words to the ArrayList
combine: to merge two array lists
wrap: to split and indent the lines
.
Collector<String, ArrayList<String>, String> toWrappedString() {
return Collector.of(ArrayList::new,
this::addIfDistinct,
this::combine,
this::wrap);
}
The addIfDistinct adds the word to the accumulator ArrayList if it is different than the previous word.
void addIfDistinct(ArrayList<String> accumulator, String word) {
if (!accumulator.isEmpty()) {
String lastWord = accumulator.get(accumulator.size() - 1);
if (!lastWord.equals(word)) {
accumulator.add(word);
}
} else {
accumulator.add(word);
}
}
The combine method adds all words from the second ArrayList to the first one. It also makes sure that the first word of the second ArrayList does not duplicate the last word of the first ArrayList.
ArrayList<String> combine(ArrayList<String> words,
ArrayList<String> moreWords) {
List<String> other = moreWords;
if (!words.isEmpty() && !other.isEmpty()) {
String lastWord = words.get(words.size() - 1);
if (lastWord.equals(other.get(0))) {
other = other.subList(1, other.size());
}
}
words.addAll(other);
return words;
}
Finally the wrap method appends all words to a StringBuffer, inserting the indent when the line length limit is reached:
String wrap(ArrayList<String> words) {
StringBuilder result = new StringBuilder();
if (!words.isEmpty()) {
String firstWord = words.get(0);
result.append(firstWord);
int lineLength = firstWord.length();
for (String word : words.subList(1, words.size())) {
//add 1 to the word length,
//to account for the space character
int len = word.length() + 1;
if (lineLength + len <= limit) {
result.append(' ');
result.append(word);
lineLength += len;
} else {
result.append(indent);
result.append(word);
//subtract 1 from the indent length,
//because the new line does not count
lineLength = indent.length() - 1 + word.length();
}
}
}
return result.toString();
}
I'm trying to search and reveal unknown characters in a string. Both strings are of length 12.
Example:
String s1 = "1x11222xx333";
String s2 = "111122223333"
The program should check for all unknowns in s1 represented by x|X and get the relevant chars in s2 and replace the x|X by the relevant char.
So far my code has replaced only the first x|X with the relevant char from s2 but printed duplicates for the rest of the unknowns with the char for the first x|X.
Here is my code:
String VoucherNumber = "1111x22xx333";
String VoucherRecord = "111122223333";
String testVoucher = null;
char x = 'x'|'X';
System.out.println(VoucherNumber); // including unknowns
//find x|X in the string VoucherNumber
for(int i = 0; i < VoucherNumber.length(); i++){
if (VoucherNumber.charAt(i) == x){
testVoucher = VoucherNumber.replace(VoucherNumber.charAt(i), VoucherRecord.charAt(i));
}
}
System.out.println(testVoucher); //after replacing unknowns
}
}
I am always a fan of using StringBuilders, so here's a solution using that:
private static String replaceUnknownChars(String strWithUnknownChars, String fullStr) {
StringBuilder sb = new StringBuilder(strWithUnknownChars);
while ((int index = Math.max(sb.toString().indexOf('x'), sb.toString().indexOf('X'))) != -1) {
sb.setCharAt(index, fullStr.charAt(index));
}
return sb.toString();
}
It's quite straightforward. You create a new string builder. While a x or X can still be found in the string builder (indexOf('X') != -1), get the index and setCharAt.
Your are using String.replace(char, char) the wrong way, the doc says
Returns a new string resulting from replacing all occurrences of oldChar in this string with newChar.
So you if you have more than one character, this will replace every one with the same value.
You need to "change" only the character at a specific spot, for this, the easiest is to use the char array that you can get with String.toCharArray, from this, this is you can use the same logic.
Of course, you can use String.indexOf to find the index of a specific character
Note : char c = 'x'|'X'; will not give you the expected result. This will do a binary operation giving a value that is not the one you want.
The OR will return 1 if one of the bit is 1.
0111 1000 (x)
0101 1000 (X)
OR
0111 1000 (x)
But the result will be an integer (every numeric operation return at minimum an integer, you can find more information about that)
You have two solution here, you either use two variable (or an array) or if you can, you use String.toLowerCase an use only char c = 'x'
I want to find out if a string that is comma separated contains only the same values:
test,asd,123,test
test,test,test
Here the 2nd string contains only the word "test". I'd like to identify these strings.
As I want to iterate over 100GB, performance matters a lot.
Which might be the fastest way of determining a boolean result if the string contains only one value repeatedly?
public static boolean stringHasOneValue(String string) {
String value = null;
for (split : string.split(",")) {
if (value == null) {
value = split;
} else {
if (!value.equals(split)) return false;
}
}
return true;
}
No need to split the string at all, in fact no need for any string manipulation.
Find the first word (indexOf comma).
Check the remaining string length is an exact multiple of that word+the separating comma. (i.e. length-1 % (foundLength+1)==0)
Loop through the remainder of the string checking the found word against each portion of the string. Just keep two indexes into the same string and move them both through it. Make sure you check the commas too (i.e. bob,bob,bob matches bob,bobabob does not).
As assylias pointed out there is no need to reset the pointers, just let them run through the String and compare the 1st with 2nd, 2nd with 3rd, etc.
Example loop, you will need to tweak the exact position of startPos to point to the first character after the first comma:
for (int i=startPos;i<str.length();i++) {
if (str.charAt(i) != str.charAt(i-startPos)) {
return false;
}
}
return true;
You won't be able to do it much faster than this given the format the incoming data is arriving in but you can do it with a single linear scan. The length check will eliminate a lot of mismatched cases immediately so is a simple optimization.
Calling split might be expensive - especially if it is 200 GB data.
Consider something like below (NOT tested and might require a bit of tweaking the index values, but I think you will get the idea) -
public static boolean stringHasOneValue(String string) {
String seperator = ",";
int firstSeparator = string.indexOf(seperator); //index of the first separator i.e. the comma
String firstValue = string.substring(0, firstSeparator); // first value of the comma separated string
int lengthOfIncrement = firstValue.length() + 1; // the string plus one to accommodate for the comma
for (int i = 0 ; i < string.length(); i += lengthOfIncrement) {
String currentValue = string.substring(i, firstValue.length());
if (!firstValue.equals(currentValue)) {
return false;
}
}
return true;
}
Complexity O(n) - assuming Java implementations of substring is efficient. If not - you can write your own substring method that takes the required no of characters from the String.
for a crack just a line code:
(#Tim answer is more efficient)
System.out.println((new HashSet<String>(Arrays.asList("test,test,test".split(","))).size()==1));
I am writing a program for pattern discovery in RNA sequences that mostly works. In order to find 'patterns' in the sequences, I am generating some possible patterns and scanning through the input file of all sequences for them (there's more to the algorithm, but this is the bit that is breaking). Possible patterns generated are of a specified length given by the user.
This works well for all sequence lengths up to 8 characters long. Then at 9, the program runs for an very long time, then gives a java.lang.OutOfMemoryError. After some debugging, I found that the weak point is the pattern generation method:
/* Get elementary pattern (ep) substrings, to later combine into full patterns */
public static void init_ep_subs(int length) {
ep_subs = new ArrayList<Substring>(); // clear static ep_subs data field
/* ep subs are of the form C1...C2...C3 where C1, C2, C3 are characters in the
alphabet and the whole length of the string is equal to the input parameter
'length'. The number of dots varies for different lengths.
The middle character C2 can occur instead of any dot, or not at all.*/
for (int i = 1; i < length-1; i++) { // for each potential position of C2
// for each alphabet character to be C1
for (int first = 0; first < alphabet.length; first++) {
// for each alphabet character to be C3
for (int last = 0; last < alphabet.length; last++) {
// make blank pattern, i.e. no C2
Substring s_blank = new Substring(-1, alphabet[first],
'0', alphabet[last]);
// get its frequency in the input string
s_blank.occurrences = search_sequences(s_blank.toString());
// if blank ep is found frequently enough in the input string, store it
if (s_blank.frequency()>=nP) ep_subs.add(s_blank);
// when C2 is present, for each character it could be
for (int mid = 0; mid < alphabet.length; mid++) {
// make pattern C1,C2,C3
Substring s = new Substring(i, alphabet[first],
alphabet[mid],
alphabet[last]);
// search input string for pattern s
s.occurrences = search_sequences(s.toString());
// if s is frequent enough, store it
if (s.frequency()>=nP) ep_subs.add(s);
}
}
}
}
}
Here's what happens: When I time the calls to search_sequences, they start out at around 40-100ms each and carry on that way for the first patterns. Then after a couple hundred patterns (around 'C.....G.C') those calls suddenly start to take about ten times as long, 1000-2000ms. After that, the times steadily increase until at about 12000ms ('C......TA') it gives this error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3209)
at java.lang.String.<init>(String.java:215)
at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
at java.nio.CharBuffer.toString(CharBuffer.java:1157)
at java.util.regex.Matcher.toMatchResult(Matcher.java:232)
at java.util.Scanner.match(Scanner.java:1270)
at java.util.Scanner.hasNextLine(Scanner.java:1478)
at PatternFinder4.search_sequences(PatternFinder4.java:217)
at PatternFinder4.init_ep_subs(PatternFinder4.java:256)
at PatternFinder4.main(PatternFinder4.java:62)
This is the search_sequences method:
/* Searches the input string 'sequences' for occurrences of the parameter string 'sub' */
public static ArrayList<int[]> search_sequences(String sub) {
/* arraylist returned holding int arrays with coordinates of the places where 'sub'
was found, i.e. {l,i} l = lines number, i = index within line */
ArrayList<int[]> occurrences = new ArrayList<int[]>();
s = new Scanner(sequences);
int line_index = 0;
String line = "";
while (s.hasNextLine()) {
line = s.nextLine();
pattern = Pattern.compile(sub);
matcher = pattern.matcher(line);
pattern = null; // all the =nulls were intended to help memory management, had no effect
int index = 0;
// for each occurrence of 'sub' in the line being scanned
while (matcher.find(index)) {
int start = matcher.start(); // get the index of the next occurrence
int[] occurrence = {line_index, start}; // make up the coordinate array
occurrences.add(occurrence); // store that occurrence
index = start+1; // start looking from after the last occurence found
}
matcher=null;
line=null;
line_index++;
}
s=null;
return occurrences;
}
I've tried the program on a couple of different computers of differing speeds, and while the actual times time complete search_sequence are smaller on faster computers, the relative times are the same; at around the same number of iterations, search_sequence starts taking ten times as long to complete.
I've tried googling about memory efficiency and speed of different input streams such as BufferedReader etc, but the general consensus seems to be that they are all roughly equivalent to Scanner. Do any of you have any advice about what this bug is or how I could try to figure it out myself?
If anyone wants to see any more of the code, just ask.
EDIT:
1 - The input file 'sequences' is 1000 protein sequences (each on one line) of varying lengths around a couple hundred characters. I should also mention this program will /only ever need to work/ up to patterns of length nine.
2 - Here are the Substring class methods used in the above code
static class Substring {
int residue; // position of the middle character C2
char front, mid, end; // alphabet characters for C1, C2 and C3
ArrayList<int[]> occurrences; // list of positions the substring occurs in 'sequences'
String string; // string representation of the substring
public Substring(int inresidue, char infront, char inmid, char inend) {
occurrences = new ArrayList<int[]>();
residue = inresidue;
front = infront;
mid = inmid;
end = inend;
setString(); // makes the string representation using characters and their positions
}
/* gets the frequency of the substring given the places it occurs in 'sequences'.
This only counts the substring /once per line ist occurs in/. */
public int frequency() {
return PatternFinder.frequency(occurrences);
}
public String toString() {
return string;
}
/* makes the string representation using the substring's characters and their positions */
private void setString() {
if (residue>-1) {
String left_mid = "";
for (int j = 0; j < residue-1; j++) left_mid += ".";
String right_mid = "";
for (int j = residue+1; j < length-1; j++) right_mid += ".";
string = front + left_mid + mid + right_mid + end;
} else {
String mid = "";
for (int i = 0; i < length-2; i++) mid += ".";
string = front + mid + end;
}
}
}
... and the PatternFinder.frequency method (called in Substring.frequency()) :
public static int frequency(ArrayList<int[]> occurrences) {
HashSet<String> lines_present = new HashSet<String>();
for (int[] occurrence : occurrences) {
lines_present.add(new String(occurrence[0]+""));
}
return lines_present.size();
}
What is alphabet? What kind of regexs are you giving it? Have you checked the number of occurrences you're storing? It's possible that simply storing the occurrences is enough to make it run out of memory, since you're doing an exponential number of searches.
It sounds like your algorithm has a hidden exponential resource usage. You need to rethink what you are trying to do.
Also, setting a local variable to null won't help since the JVM already does data flow and liveness analysis.
Edit: Here's a page that explains how even short regexes can take an exponential amount of time to run.
I can't spot an obvious memory leak, but your program does have a number of inefficiencies. Here are some recommendations:
Indent your code properly. It will make reading it, both for you and for others, much easier. In its current form it's very hard to read.
If you're referring to a member variable, prefix it with this., otherwise readers of code snippets won't know for sure what you're referring to.
Avoid static members and methods unless they're absolutely necessary. When referring to them, use the Classname.membername form, for the same reasons.
How is the code of frequency() different from just return occurrences.size()?
In search_sequences(), the regex string sub is a constant. You need to compile it only once, but you're recompiling it for every line.
Split the input string (sequences) into lines once and store them in an array or ArrayList. Don't re-split inside search_sequences(), pass the split collection in.
There are probably more things to fix, but this is the list that jumps out.
Fix all these and if you still have problems, you may need to use a profiler to find out what's happening.
In Java,
I need to read lines of text from a file and then reverse each line, writing the reversed version into another file. I know how to read from one file and write to another. What I don't know how to do is manipulate the text so that "This is line 1" would be written into the second file as "1 enil si sihT"
since these are homeworks you are probably interested in your own implementation of reverse method.
The naive version visits the string backwards (from the last index to the index 0) while copying it in a StringBuilder:
public String reverse(String s) {
StringBuilder sb = new StringBuilder();
for (int i = s.length() - 1; i >= 0; i--) {
sb.append(s.charAt(i));
}
return sb.toString();
}
for example the String "hello":
H e l l o
0 1 2 3 4 // indexes for charAt()
the method start by the index 4 ('o') then the index 3 ('l') ... until 0 ('H').
StringBuilder buffer = new StringBuilder(theString);
return buffer.reverse().toString();
If this is homework, it would be better for you to understand how are data stored into the string it self.
A string may be represented as an array of characters
String line = // read line ....;
char [] data = line.toCharArray();
To reverse an array you have to swap the positions of the elements. The first in the last, the last in the first and so on.
int l = data.length;
char temp;
temp = data[0]; // put the first element in "temp" to avoid losing it.
data[0] = data[l - 1]; // put the last value in the first;
data[l - 1] = temp; // and the first in the last.
Continue with the rest of the elements ( hint use a loop ) in the array and then create a new String with the result:
String modifiedString = new String( data ); // where data is the reversed array.
If is not ( and you really just need to have the work done ) use:
StringBuilder.reverse()
Good luck.
String reversed = new StringBuilder(textLine).reverse().toString();
The provided answers all suggest using an already existing method, which is sound advice and usually more effective than writing your own.
Depending on the assignment, however, your teacher might expect you to write a method of your own. If that is the case, try using a for loop to walk through the string character by character, only instead of counting from zero and up, start counting from the last character index and down to zero, consecutively building the reversed string.
While we're feeding horrible, finished answers to the poor student, we might as well whet his appetite for the bizarre. If strings were guaranteed to be reasonably short and CPU time was no object, this is what I'd code:
public static String reverse(String str) {
if (str.length() == 0) return "";
else return reverse(str.substring(1)) + str.charAt(0);
}
(OK, I admit it: my current favorite language is Clojure, a Lisp!)
BONUS HOMEWORK: Figure out if, how and why this works!
java.lang.StringBuffer has a reverse method.