If I had a .txt file called animals that had fishfroggoat etc. in it, and another file called owners that had something like:
fish:jane
frog:mark
goat:joe
how could I go about pairing the pets to their owners? I'm fairly sure a HashMap would be good here, but I'm stuck. I put the animal text into a string, but I don't know how to break it up into 4 characters properly.
Any help would be great.
Sorry I didn't add any code, but thanks to you guys' help (especially Ted Hopps) I worked it out and, more importantly, understood it. :-)
There are various approaches. The most direct is to split it using the substring method:
String animals = "fishfroggoat";
String fish = animals.substring(0, 4);
String frog = animals.substring(4, 8);
String goat = animals.substring(8); // or (8, 12)
If you have an arbitrarily long list of 4-character animals, you can do this:
String animals = "fishfroggoatbear";
int n = animals.length() / 4;
String[] animalArray = new String[n];
for (int i = 0; i < n; ++i) {
animalArray[i] = animals.substring(4*i, 4*i + 4);
}
You can split the pet/owner strings using split:
String rawData = "fish:jane";
String[] data = rawData.split(":");
String pet = data[0];
String owner = data[1];
Use String split as given below.
String msg=fish:jane;
msg.split(":")
Then it will make array separate by ":".
This is how you split a string into 4-character chunks in just one line:
String[] animals = input.split("(?<=\\G....)");
This may seem like black magic, so I'll try to demystify it. Welcome to the dark art of regular expressions...
The String.split() method splits the string on every match to the specified regex. So let's look at the regex:
(?<=\\G....)
The construct (?<=regex) is a "positive look behind" for the regex, meaning that the characters preceding the point in the input between characters (because a look behind is zero-width) must natch the regex.
The regex \G (coded as \\G as a java String constant) means "start of previous match" but also initially matches start of input.
The regex .... matches any 4 characters.
Thus, when expressed in English, the regex (?<=\\G....) means "after every characters".
IF anyone is interested, removing \G and splitting on (?<=\....) causes it to split on every character after the 4th = it just means "preceded by 4 characters" - you need the \G to find 4 new characters.
Here's some test code:
public static void main(String[] args) throws Exception {
String input = "fishfroggoatbear";
String[] animals = input.split("(?<=\\G....)");
System.out.println(Arrays.toString(animals));
}
Output:
[fish, frog, goat, bear]
Related
Say i have a simple sentence as below.
For example, this is what have:
A simple sentence consists of only one clause. A compound sentence
consists of two or more independent clauses. A complex sentence has at
least one independent clause plus at least one dependent clause. A set
of words with no independent clause may be an incomplete sentence,
also called a sentence fragment.
I want only first 10 words in the sentence above.
I'm trying to produce the following string:
A simple sentence consists of only one clause. A compound
I tried this:
bigString.split(" " ,10).toString()
But it returns the same bigString wrapped with [] array.
Thanks in advance.
Assume bigString : String equals your text. First thing you want to do is split the string in single words.
String[] words = bigString.split(" ");
How many words do you like to extract?
int n = 10;
Put words together
String newString = "";
for (int i = 0; i < n; i++) { newString = newString + " " + words[i];}
System.out.println(newString);
Hope this is what you needed.
If you want to know more about regular expressions (i.e. to tell java where to split), see here: How to split a string in Java
If you use the split-Method with a limiter (yours is 10) it won't just give you the first 10 parts and stop but give you the first 9 parts and the 10th place of the array contains the rest of the input String. ToString concatenates all Strings from the array resulting in the whole input String. What you can do to achieve what you initially wanted is:
String[] myArray = bigString.split(" " ,11);
myArray[10] = ""; //setting the rest to an empty String
myArray.toString(); //This should give you now what you wanted but surrouned with array so just cut that off iterating the array instead of toString or something.
This will help you
String[] strings = Arrays.stream(bigstring.split(" "))
.limit(10)
.toArray(String[]::new);
Here is exactly what you want:
String[] result = new String[10];
// regex \s matches a whitespace character: [ \t\n\x0B\f\r]
String[] raw = bigString.split("\\s", 11);
// the last entry of raw array is the whole sentence, need to be trimmed.
System.arraycopy(raw, 0, result , 0, 10);
System.out.println(Arrays.toString(result));
How to split this String in java such that I'll get the text occurring between the braces in a String array?
GivenString = "(1,2,3,4,#) (a,s,3,4,5) (22,324,#$%) (123,3def,f34rf,4fe) (32)"
String [] array = GivenString.split("");
Output must be:
array[0] = "1,2,3,4,#"
array[1] = "a,s,3,4,5"
array[2] = "22,324,#$%"
array[3] = "123,3def,f34rf,4fe"
array[4] = "32"
You can try to use:
Matcher mtc = Pattern.compile("\\((.*?)\\)").matcher(yourString);
The best solution is the answer by Rahul Tripathi, but your question said "How to split", so if you must use split() (e.g. this is an assignment), then this regex will do:
^\s*\(|\)\s*\(|\)\s*$
It says:
Match the open-parenthesis at the beginning
Match close-parenthesis followed by open-parenthesis
Match the close-parenthesis at the end
All 3 allowing whitespace.
As a Java regex, that would mean:
str.split("^\\s*\\(|\\)\\s*\\(|\\)\\s*$")
See regex101 for demo.
The problem with using split() is that the leading open-parenthesis causes a split before the first value, resulting in an empty value at the beginning:
array[0] = ""
array[1] = "1,2,3,4,#"
array[2] = "a,s,3,4,5"
array[3] = "22,324,#$%"
array[4] = "123,3def,f34rf,4fe"
array[5] = "32"
That is why Rahul's answer is better, because it won't see such an empty value.
Usually, you would want to use the split() function as this is the easiest way to split a string into multiple arrays when the string is broken up by a key char.
The main problem is that you need information inbetween two chars. The easiest way to solve this problem would to go through the string get ride of every instance of '('. This leaves the string looking like
String = "1,2,3,4,#) a,s,3,4,5) 22,324,#$%) 123,3def,f34rf,4fe) 32)"
And this is perfect, as you can split by the char ')' and not worry about the other bracket interfering with the split. I suggest using the replace("","") where it replaces every instance of the first parameter with the second parameter (we can use "" to delete it).
Here is some example code that may work :
String a = "(1,2,3,4,#) (a,s,3,4,5) (22,324,#$%) (123,3def,f34rf,4fe) (32)"
a = a.replace("(","");
//a is now equal to 1,2,3,4,#) a,s,3,4,5) 22,324,#$%) 123,3def,f34rf,4fe) 32)
String[] parts = a.split("\\)");
System.out.println(parts[0]); //this will print 1,2,3,4,#
I haven't tested it completely, so you may end up with unwanted spaces at the end of the strings you may need to get rid of!
You can then loop through parts[] and it should have all of the required parts for you!
I never understood how to make properly regex to divide my Strings.
I have this types of Strings example = "on[?a, ?b, ?c]";
Sometimes I have this, Strings example2 = "not clear[?c]";
For the first Example I would like to divide into this:
[on, a, b, c]
or
String name = "on";
String [] vars = [a,b,c];
And for the second example I would like to divide into this type:
[not clear, c]
or
String name = "not clear";
String [] vars = [c];
Thanks alot in advance guys ;)
If you know the character set of your identifiers, you can simply do a split on all of the text that isn't in that set. For example, if your identifiers only consist of word characters ([a-zA-Z_0-9]) you can use:
String[] parts = "on[?a, ?b, ?c]".split("[\\W]+");
String name = parts[0];
String[] vars = Arrays.copyOfRange(parts, 1, parts.length);
If your identifiers only have A-Z (upper and lower) you could replace \\W above with ^A-Za-z.
I feel that this is more elegant than using a complex regular expression.
Edit: I realize that this will have issues with your second example "not clear". If you have no option of using something like an underscore instead of a space there, you could do one split on [? (or substring) to get the "name", and another split on the remainder, like so:
String s = "not clear[?a, ?b, ?c]";
String[] parts = s.split("\\[\\?"); //need the '?' so we don't get an extra empty array element in the next split
String name = parts[0];
String[] vars = parts[1].split("[\\W]+");
This comes close, but the problem is the third remembered group is actually repeated so it only captures the last match.
(.*?)\[(?:\s*(?:\?(.*?)(?:\s*,\s*\?(.*?))*)\s*)?]
For example, the first one you list on[?a, ?b, ?c] would give group 1 as on, 2 as a 3 as c. If you are using perl, you could the g flag to apply a regex to a line multiple times and use this:
my #tokens;
while ( my $line =~ /\s*(.*?)\s*[[,\]]/g ) {
push( #tokens, $1 );
}
Note, i did not actually test the perl code, just off the top of my head. It should give you the idea though
String[] parts = example.split("[^\\w ]");
List<String> x = new ArrayList<String>();
for (int i = 0; i < parts.length; i++) {
if (!"".equals(parts[i]) && !" ".equals(parts[i])) {
x.add(parts[i]);
}
}
This will work as long as you don't have more than one space separating your non-space characters. There's probably a cleverer way of filtering out the null and " " strings.
I am working on a short assignment from school and for some reason, two delimiters right next to each other is causing an empty line to print. I would understand this if there was a space between them but there isn't. Am I missing something simple? I would like each token to print one line at a time without printing the ~.
public class SplitExample
{
public static void main(String[] args)
{
String asuURL = "www.public.asu.edu/~JohnSmith/CSE205";
String[] words = new String[6];
words = asuURL.split("[./~]");
for(int i = 0; i < words.length; i++)
{
System.out.println(words[i]);
}
}
}//end SplitExample
Edit: Desired output below
www
public
asu
edu
JohnSmith
CSE205
I would understand this if there was a space between them but there
isn't
Yeah sure there is no space between them, but there is an empty string between them. In fact, you have empty string between each pair of character.
That is why when your delimiter are next to each other, the split will get the empty string between them.
I would like each token to print one line at a time without printing
the ~.
Then why have you included / in your delimiters? [./~] means match ., or /, or ~. Splitting on them will also not print the /.
If you just want to split on ~., then just use them in character class - [.~]. But again, it's not really clear, what output you want exactly. May be you can post an expected output for your current input.
Seems like you are splitting on ., / and /~. In which case, you can't use character class here. You can use this pattern to split: -
String[] words = asuURL.split("[.]|/~?");
This will split on: - ., /~ or / (Since ~ is optional)
What do you think the spit produces? What's between / and ~ ? Yes, that's right, there is an empty string ("").
There are no characters between /~ so when you split on these characters you should expect to see a blank string.
I suspect you don't need to split on ~ and in fact I wouldn't split on . either.
Let's say you have a text file like this one:
http://www.gutenberg.org/files/17921/17921-8.txt
Does anyone has a good algorithm, or open-source code, to extract words from a text file?
How to get all the words, while avoiding special characters, and keeping things like "it's", etc...
I'm working in Java.
Thanks
This sounds like the right job for regular expressions. Here is some Java code to give you an idea, in case you don't know how to start:
String input = "Input text, with words, punctuation, etc. Well, it's rather short.";
Pattern p = Pattern.compile("[\\w']+");
Matcher m = p.matcher(input);
while ( m.find() ) {
System.out.println(input.substring(m.start(), m.end()));
}
The pattern [\w']+ matches all word characters, and the apostrophe, multiple times. The example string would be printed word-by-word. Have a look at the Java Pattern class documentation to read more.
Here's a good approach to your problem:
This function receives your text as an input and returns an array of all the words inside the given text
private ArrayList<String> get_Words(String SInput){
StringBuilder stringBuffer = new StringBuilder(SInput);
ArrayList<String> all_Words_List = new ArrayList<String>();
String SWord = "";
for(int i=0; i<stringBuffer.length(); i++){
Character charAt = stringBuffer.charAt(i);
if(Character.isAlphabetic(charAt) || Character.isDigit(charAt)){
SWord = SWord + charAt;
}
else{
if(!SWord.isEmpty()) all_Words_List.add(new String(SWord));
SWord = "";
}
}
return all_Words_List;
}
Pseudocode would look like this:
create words, a list of words, by splitting the input by whitespace
for every word, strip out whitespace and punctuation on the left and the right
The python code would be something like this:
words = input.split()
words = [word.strip(PUNCTUATION) for word in words]
where
PUNCTUATION = ",. \n\t\\\"'][#*:"
or any other characters you want to remove.
I believe Java has equivalent functions in the String class: String.split() .
Output of running this code on the text you provided in your link:
>>> print words[:100]
['Project', "Gutenberg's", 'Manual', 'of', 'Surgery', 'by', 'Alexis',
'Thomson', 'and', 'Alexander', 'Miles', 'This', 'eBook', 'is', 'for',
'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and',
'with', 'almost', 'no', 'restrictions', 'whatsoever', 'You', 'may',
'copy', 'it', 'give', 'it', 'away', 'or', 're-use', 'it', 'under',
... etc etc.
Basically, you want to match
([A-Za-z])+('([A-Za-z])*)?
right?
You could try regex, using a pattern you've made, and run a count the number of times that pattern has been found.