I have a code to remove duplicate words from a string. Lets say i have:
This is serious serious work. I apply the code and get: This is serious work
This is the code:
return Arrays.stream(input.split(" ")).distinct().collect(Collectors.joining(" "));
Now i want to add new constraints that is if the string/line is longer than 78 characters, break and indent it where it makes sense so the line does not run longer than 78 characters. Example:
This one is a very long line that runs off the right side because it is longer than 78 characters long
It should then be
This one is a very long line that runs off the right side because it is longer
than 78 characters long
I cant find a solution to this. It was brought to my attention that there is a possible duplicate to my question. I cant find my answer there. I need to be able to indent.
You could create a StringBuilder off of the String and then insert a newline and tab at the last word break after 78 characters. You can find the last word break to insert the newline/tab by getting the substring of the first 78 characters, and then finding the index of the last space:
StringBuilder sb = new StringBuilder(Arrays.stream(input.split(" ")).distinct().collect(Collectors.joining(" ")));
if(sb.length() > 78) {
int lastWordBreak = sb.substring(0, 78).lastIndexOf(" ");
sb.insert(lastWordBreak , "\n\t");
}
return sb.toString();
Output:
This one is a very long line that runs off the right side because it longer
than 78 characters
Also your Stream does not do what you want it to. Yes it removes duplicate words but.. it removes duplicate words. So for the String:
This is a great sentence. It is a great example.
It would remove the duplicate is, great and a, and return
This is a great sentence. It example.
To only remove consecutive duplicate words you can look at the following solution:
Removing consecutive duplicates words out of text using Regex and displaying the new text
Alternatively you could create your own them by splitting the text into words, and comparing the current element to the one ahead of it to remove the consecutive duplicate words
Instead of using
Collectors.joining(" ")
it is possible to write a custom collector that adds new lines and indentation at proper places.
Let's introduce a LineWrapper class, which contains indent and limit fields:
public class LineWrapper {
private final int limit;
private final String indent;
The default constructor sets the fields to reasonable default values.
Note how the indent starts with a new line character.
public LineWrapper() {
limit = 78;
indent = "\n ";
}
A custom constructor allows the client to specify limit and indent:
public LineWrapper(int limit, String indent) {
if (limit <= 0) {
throw new IllegalArgumentException("limit");
}
if (indent == null || !indent.matches("\\n *")) {
throw new IllegalArgumentException("indent");
}
this.limit = limit;
this.indent = indent;
}
Following is a regex used to split the input around one or more spaces. This makes sure that the split will not produce empty Strings:
private static final String SPACES = " +";
The apply method splits the input and collects the words into lines of the specified maximum length, indents the lines and removes duplicate consecutive words. Note how duplicates are not removed using the Stream.distinct method, since it also removes duplicates that are not consecutive.
public String apply(String input) {
return Arrays.stream(input.split(SPACES)).collect(toWrappedString());
}
The toWrappedString method returns a collector that accumulates the words in a new ArrayList, and uses the following methods:
addIfDistinct: to add the words to the ArrayList
combine: to merge two array lists
wrap: to split and indent the lines
.
Collector<String, ArrayList<String>, String> toWrappedString() {
return Collector.of(ArrayList::new,
this::addIfDistinct,
this::combine,
this::wrap);
}
The addIfDistinct adds the word to the accumulator ArrayList if it is different than the previous word.
void addIfDistinct(ArrayList<String> accumulator, String word) {
if (!accumulator.isEmpty()) {
String lastWord = accumulator.get(accumulator.size() - 1);
if (!lastWord.equals(word)) {
accumulator.add(word);
}
} else {
accumulator.add(word);
}
}
The combine method adds all words from the second ArrayList to the first one. It also makes sure that the first word of the second ArrayList does not duplicate the last word of the first ArrayList.
ArrayList<String> combine(ArrayList<String> words,
ArrayList<String> moreWords) {
List<String> other = moreWords;
if (!words.isEmpty() && !other.isEmpty()) {
String lastWord = words.get(words.size() - 1);
if (lastWord.equals(other.get(0))) {
other = other.subList(1, other.size());
}
}
words.addAll(other);
return words;
}
Finally the wrap method appends all words to a StringBuffer, inserting the indent when the line length limit is reached:
String wrap(ArrayList<String> words) {
StringBuilder result = new StringBuilder();
if (!words.isEmpty()) {
String firstWord = words.get(0);
result.append(firstWord);
int lineLength = firstWord.length();
for (String word : words.subList(1, words.size())) {
//add 1 to the word length,
//to account for the space character
int len = word.length() + 1;
if (lineLength + len <= limit) {
result.append(' ');
result.append(word);
lineLength += len;
} else {
result.append(indent);
result.append(word);
//subtract 1 from the indent length,
//because the new line does not count
lineLength = indent.length() - 1 + word.length();
}
}
}
return result.toString();
}
Related
I am working on an exercise with the following criteria:
"The input consists of pairs of tokens where each pair begins with the type of ticket that the person bought ("coach", "firstclass", or "discount", case-sensitively) and is followed by the number of miles of the flight."
The list can be paired -- coach 1500 firstclass 2000 discount 900 coach 3500 -- and this currently works great. However, when the String and int value are split like so:
firstclass 5000 coach 1500 coach
100 firstclass
2000 discount 300
it breaks entirely. I am almost certain that it has something to do with me using this format (not full)
while(fileScanner.hasNextLine())
{
StringTokenizer token = new StringTokenizer(fileScanner.nextLine(), " ")
while(token.hasMoreTokens())
{
String ticketClass = token.nextToken().toLowerCase();
int count = Integer.parseInt(token.nextToken());
...
}
}
because it will always read the first value as a String and the second value as an integer. I am very lost on how to keep track of one or the other while going to read the next line. Any help is truly appreciated.
Similar (I think) problems:
Efficient reading/writing of key/value pairs to file in Java
Java-Read pairs of large numbers from file and represent them with linked list, get the sum and product of each pair
Reading multiple values in multiple lines from file (Java)
If you can afford to read the text file in all at once as a very long String, simply use the built-in String.split() with the regex \\s+, like so
String[] tokens = fileAsString.split("\\s+");
This will split the input file into tokens, assuming the tokens are separated by one or more whitespace characters (a whitespace character covers newline, space, tab, and carriage return). Even and odd tokens are ticket types and mile counts, respectively.
If you absolutely have to read in line-by-line and use StringTokenizer, a solution is to count number of tokens in the last line. If this number is odd, the first token in the current line would be of a different type of the first token in the last line. Once knowing the starting type of the current line, simply alternating types from there.
int tokenCount = 0;
boolean startingType = true; // true for String, false for integer
boolean currentType;
while(fileScanner.hasNextLine())
{
StringTokenizer token = new StringTokenizer(fileScanner.nextLine(), " ");
startingType = startingType ^ (tokenCount % 2 == 1); // if tokenCount is odd, the XOR ^ operator will flip the starting type of this line
tokenCount = 0;
while(token.hasMoreTokens())
{
tokenCount++;
currentType = startingType ^ (tokenCount % 2 == 0); // alternating between types in current line
if (currentType) {
String ticketClass = token.nextToken().toLowerCase();
// do something with ticketClass here
} else {
int mileCount = Integer.parseInt(token.nextToken());
// do something with mileCount here
}
...
}
}
I found another way to do this problem without using either the StringTokenizer or the regex...admittedly I had trouble with the regular expressions haha.
I declare these outside of the try-catch block because I want to use them in both my finally statement and return the points:
int points = 0;
ArrayList<String> classNames = new ArrayList<>();
ArrayList<Integer> classTickets = new ArrayList<>();
Then inside my try-statement, I declare the index variable because I won't need that outside of this block. That variable increases each time a new element is read. Odd elements are read as ticket classes and even elements are read as ticket prices:
try
{
int index = 0;
// read till the file is empty
while(fileScanner.hasNext())
{
// first entry is the ticket type
if(index % 2 == 0)
classNames.add(fileScanner.next());
// second entry is the number of points
else
classTickets.add(Integer.parseInt(fileScanner.next()));
index++;
}
}
You can either catch it here like this or use throws NoSuchElementException in your method declaration -- As long as you catch it on your method call
catch(NoSuchElementException noElement)
{
System.out.println("<###-NoSuchElementException-###>");
}
Then down here, loop through the number of elements. See which flight class it is and multiply the ticket count respectively and return the points outside of the block:
finally
{
for(int i = 0; i < classNames.size(); i++)
{
switch(classNames.get(i).toLowerCase())
{
case "firstclass": // 2 points for first
points += 2 * classTickets.get(i);
break;
case "coach": // 1 point for coach
points += classTickets.get(i);
break;
default:
// budget gets nothing
}
}
}
return points;
The regex seems like the most convenient way, but this was more intuitive to me for some reason. Either way, I hope the variety will help out.
simply use the built-in String.split() - #bui
I was finally able to wrap my head around regular expressions, but \s+ was not being recognized for some reason. It kept giving me this error message:
Invalid escape sequence (valid ones are \b \t \n \f \r " ' \ )Java(1610612990)
So when I went through with those characters instead, I was able to write this:
int points = 0, multiplier = 0, tracker = 0;
while(fileScanner.hasNext())
{
String read = fileScanner.next().split(
"[\b \t \n \f \r \" \' \\ ]")[0];
if(tracker % 2 == 0)
{
if(read.toLowerCase().equals("firstclass"))
multiplier = 2;
else if(read.toLowerCase().equals("coach"))
multiplier = 1;
else
multiplier = 0;
}else
{
points += multiplier * Integer.parseInt(read);
}
tracker++;
}
This code goes one entry at a time instead of reading a whole array void of whitespace as a work-around for that error message I was getting. If you could show me what the code would look like with String[] tokens = fileAsString.split("\s+"); instead I would really appreciate it :)
you need to add another "\" before "\s" to escape the slash before "s" itself – #bui
I want to change this binary string "100110001" into "1 00 11 000 1".
I tried finding the answer to that and had no luck finding it. I've tried to approach this problem using split() method.
You can use split() but you need a regex that identifies the correct points to split. Afterward, you can combine the parts again with a space in between:
String input = "100110001";
String result = String. join(" ", input.split("(?<=(.))(?!\\1)"));
System.out.println(result);
Output:
1 00 11 000 1
Edit: The regex simply checks if the current character is not occurring again in the next position. If the character is not occurring back to back we want to split.
It can be done without need to resort to regular expressions by utilizing a plain for loop and StringBuilder in a single pass through the given string, i.e. in O(n) time.
This approach is more simple but a bit more verbose than regex-based solution. The overall performance is almost the same.
The logic:
cut out cases when the given string contains less than two characters;
declare a local variable prev that will store a character at the previous position and initialize it with the first character of the given string;
iterate though the given string and in every case when previous and next characters don't match append an empty space to the result.
The code might look like this:
public static String insertSpaces(String source) {
if (source.length() < 2) { // space can't be inserted
return source;
}
StringBuilder result = new StringBuilder();
char prev = source.charAt(0);
for (int i = 0; i < source.length(); i++) {
char next = source.charAt(i);
if (next != prev) {
result.append(" ");
prev = next;
}
result.append(next);
}
return result.toString();
}
main()
public static void main(String[] args) {
String source = "100110001";
System.out.println(insertSpaces(source));
}
output
1 00 11 000 1
Recently i have come up against a question which i am not able to tackle in school.
I need to remove duplicate words in an input string which consists of words. The main issue here is that the requirement states that i cannot use arrays or regular expressions.
E.g.
userInput = "this is a test testing is fun really fun"
the first "is" is a duplicate of "this" as it is a substring
the second "is" is a duplicate of the first "is"
"testing" is not a duplicate of "test" as it is not an exact match
therefore the output comes out as - "this a test testing fun really"
How would one actually achieve this without using Arrays or Regular Expressions as it is impossible to split the words up by the white spaces and dynamically create a String in java.
I didn't compile this code, but I think it should works.
Let me know if it can help you to solved your problem.
public String solve(String input) {
String ret = "";
int pos = 0;
while(pos<input.length()) {
// find next position of space
int next = input.indexOf(' ',pos);
// space not exists, skip next to end of string
if(next==-1) next = input.length();
// take 1 word from input
String word = input.substring(pos,next);
// check if word exists in previous result
if(ret.indexOf(word)==-1) {
if(ret.length() > 0) ret += " ";
// append word to ret
ret += word;
}
pos = next + 1;
}
return ret;
}
I want to find out if a string that is comma separated contains only the same values:
test,asd,123,test
test,test,test
Here the 2nd string contains only the word "test". I'd like to identify these strings.
As I want to iterate over 100GB, performance matters a lot.
Which might be the fastest way of determining a boolean result if the string contains only one value repeatedly?
public static boolean stringHasOneValue(String string) {
String value = null;
for (split : string.split(",")) {
if (value == null) {
value = split;
} else {
if (!value.equals(split)) return false;
}
}
return true;
}
No need to split the string at all, in fact no need for any string manipulation.
Find the first word (indexOf comma).
Check the remaining string length is an exact multiple of that word+the separating comma. (i.e. length-1 % (foundLength+1)==0)
Loop through the remainder of the string checking the found word against each portion of the string. Just keep two indexes into the same string and move them both through it. Make sure you check the commas too (i.e. bob,bob,bob matches bob,bobabob does not).
As assylias pointed out there is no need to reset the pointers, just let them run through the String and compare the 1st with 2nd, 2nd with 3rd, etc.
Example loop, you will need to tweak the exact position of startPos to point to the first character after the first comma:
for (int i=startPos;i<str.length();i++) {
if (str.charAt(i) != str.charAt(i-startPos)) {
return false;
}
}
return true;
You won't be able to do it much faster than this given the format the incoming data is arriving in but you can do it with a single linear scan. The length check will eliminate a lot of mismatched cases immediately so is a simple optimization.
Calling split might be expensive - especially if it is 200 GB data.
Consider something like below (NOT tested and might require a bit of tweaking the index values, but I think you will get the idea) -
public static boolean stringHasOneValue(String string) {
String seperator = ",";
int firstSeparator = string.indexOf(seperator); //index of the first separator i.e. the comma
String firstValue = string.substring(0, firstSeparator); // first value of the comma separated string
int lengthOfIncrement = firstValue.length() + 1; // the string plus one to accommodate for the comma
for (int i = 0 ; i < string.length(); i += lengthOfIncrement) {
String currentValue = string.substring(i, firstValue.length());
if (!firstValue.equals(currentValue)) {
return false;
}
}
return true;
}
Complexity O(n) - assuming Java implementations of substring is efficient. If not - you can write your own substring method that takes the required no of characters from the String.
for a crack just a line code:
(#Tim answer is more efficient)
System.out.println((new HashSet<String>(Arrays.asList("test,test,test".split(","))).size()==1));
I was asked in interview following question. I could not figure out how to approach this question. Please guide me.
Question: How to know whether a string can be segmented into two strings - like breadbanana is segmentable into bread and banana, while breadbanan is not. You will be given a dictionary which contains all the valid words.
Build a trie of the words you have in the dictionary, which will make searching faster.
Search the tree according to the following letters of your input string. When you've found a word, which is in the tree, recursively start from the position after that word in the input string. If you get to the end of the input string, you've found one possible fragmentation. If you got stuck, come back and recursively try another words.
EDIT: sorry, missed the fact, that there must be just two words.
In this case, limit the recursion depth to 2.
The pseudocode for 2 words would be:
T = trie of words in the dictionary
for every word in T, which can be found going down the tree by choosing the next letter of the input string each time we move to the child:
p <- length(word)
if T contains input_string[p:length(intput_string)]:
return true
return false
Assuming you can go down to a child node in the trie in O(1) (ascii indexes of children), you can find all prefixes of the input string in O(n+p), where p is the number of prefixes, and n the length of the input. Upper bound on this is O(n+m), where m is the number of words in dictionary. Checking for containing will take O(w) where w is the length of word, for which the upper bound would be m, so the time complexity of the algorithm is O(nm), since O(n) is distributed in the first phase between all found words.
But because we can't find more than n words in the first phase, the complexity is also limited to O(n^2).
So the search complexity would be O(n*min(n, m))
Before that you need to build the trie which will take O(s), where s is the sum of lengths of words in the dictionary. The upper bound on this is O(n*m), since the maximum length of every word is n.
you go through your dictionary and compare every term as a substring with the original term e.g. "breadbanana". If the first term matches with the first substring, cut the first term out of the original search term and compare the next dictionary entries with the rest of the original term...
let me try to explain that in java:
e.g.
String dictTerm = "bread";
String original = "breadbanana";
// first part matches
if (dictTerm.equals(original.substring(0, dictTerm.length()))) {
// first part matches, get the rest
String lastPart = original.substring(dictTerm.length());
String nextDictTerm = "banana";
if (nextDictTerm.equals(lastPart)) {
System.out.println("String " + original +
" contains the dictionary terms " +
dictTerm + " and " + lastPart);
}
}
The simplest solution:
Split the string between every pair of consecutive characters and see whether or not both substrings (to the left of the split point and to the right of it) are in the dictionary.
One approach could be:
Put all elements of dictionary in some set or list
now you can use contains & substring function to remove words which matches dictionary. if at the end string is null -> string can be segmented else not. You can also take care of count.
public boolean canBeSegmented(String s) {
for (String word : dictionary.getWords()) {
if (s.contains(word) {
String sub = s.subString(0, s.indexOf(word));
s = sub + s.subString(s.indexOf(word)+word.length(), s.length()-1);
}
return s.equals("");
}
}
This code checks if your given String can be fully segmented. It checks if a word from the dictionary is inside your string and then subtracks it. If you want to segment it in the process you have to order the subtracted sementents in the order they are inside the word.
Just two words makes it easier:
public boolean canBeSegmented(String s) {
boolean wordDetected = false;
for (String word : dictionary.getWords()) {
if (s.contains(word) {
String sub = s.subString(0, s.indexOf(word));
s = sub + s.subString(s.indexOf(word)+word.length(), s.length()-1);
if(!wordDetected)
wordDetected = true;
else
return s.equals("");
}
return false;
}
}
This code checks for one Word and if there is another word in the String and just these two words it returns true otherwise false.
this is a mere idea , you can implement it better if you want
package farzi;
import java.util.ArrayList;
public class StringPossibility {
public static void main(String[] args) {
String str = "breadbanana";
ArrayList<String> dict = new ArrayList<String>();
dict.add("bread");
dict.add("banana");
for(int i=0;i<str.length();i++)
{
String word1 = str.substring(0,i);
String word2 = str.substring(i,str.length());
System.out.println(word1+"===>>>"+word2);
if(dict.contains(word1))
{
System.out.println("word 1 found : "+word1+" at index "+i);
}
if(dict.contains(word2))
{
System.out.println("word 2 found : "+ word2+" at index "+i);
}
}
}
}