How to split a string and extract specific elements? - java

I have a file, which consists of lines such as
20 19:0.26 85:0.36 1064:0.236 # 750
I have been able to read it line by line and output it to the console. However, what I really need is to extract the elements like "19:0.26" "85:0.36" from each line, and perform certain operations on them. How to split the lines and get the elements that I want.

Use a regular expression:
Pattern.compile("\\d+:\\d+\\.\\d+");
Then you can create a Matcher object from this pattern end use its method find().

Parsing a line of data depends heavily on what the data is like and how consistent it is. Purely from your example data and the "elements like" that you mention, this could be as easy as
String[] parts = line.split(" ");

Modify this code as per yours,
public class JavaStringSplitExample{
public static void main(String args[]){
String str = "one-two-three";
String[] temp;
/* delimiter */
String delimiter = "-";
/* given string will be split by the argument delimiter provided. */
temp = str.split(delimiter);
/* print substrings */
for(int i =0; i < temp.length ; i++)
System.out.println(temp[i]);
/*
IMPORTANT : Some special characters need to be escaped while providing them as
delimiters like "." and "|".
*/
System.out.println("");
str = "one.two.three";
delimiter = "\\.";
temp = str.split(delimiter);
for(int i =0; i < temp.length ; i++)
System.out.println(temp[i]);
/*
Using second argument in the String.split() method, we can control the maximum
number of substrings generated by splitting a string.
*/
System.out.println("");
temp = str.split(delimiter,2);
for(int i =0; i < temp.length ; i++)
System.out.println(temp[i]);
}
}

Java Strings have a split method that you can call
String [] stringArray = "some string".split(" ");
You can use a Regular expression if you want to so that you can match certain characters to split off of.
String Doc:
http://docs.oracle.com/javase/6/docs/api/java/lang/String.html
Pattern Doc (Used to make regular expressions):
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

Related

2 columns from Pipe delimited 5columns using regex in java

3 columns from Pipe delimited 7columns using regex in java
Example:
String:
10|name|city||date|0|9013
i only want upto city(3 columns):
Expected output:
10|name|city
means: i want number of columns based on | using regex.
Thank you.
I would use a simple regex pattern with the split method. There's probably a more elegant way to handle the pipes in the resulting string but this should give you an idea, goodluck!
public static void main(String[] args) {
String str = "10|name|city||date|0|9013";
// split the string whenever we see a pipe
String[] arrOfStr = str.split("\\|");
StringBuilder sb = new StringBuilder();
// loop through the array we generated and format our output
// we only want the first three elements so loop accordingly
for (int i = 0; i < 3; i++) {
sb.append(arrOfStr[i]+"|");
}
// remove the trailing pipe
sb.setLength(sb.length() - 1);
System.out.println(sb.toString());
}

Regex to remove special characters in java

I have a string with a couple of special characters and need to remove only a few (~ and `). I have written the code below, but when I print the splitted strings, getting empty also with values.
String str = "ABC123-xyz`~`XYZ 1.7A";
String[] str1= varinaces.split("[\\~`]");
for(int i=0; i< str1.length ; i++){
System.out.println("str==="+ parts[i] );
}
Output:
str===ABC123-xyz
str===
str===
str===XYZ 1.7A
why empty strings also printing here ?
You’re splitting on one special char... split on 1 or more:
String[] str1= varinaces.split("[~`]+");
Note also that the tilda ~ doesn’t need escaping.
Its because when you use the .split() method it returns a String array of 4 items shown below:
String[4] { "ABC123-xyz", "", "", "XYZ 1.7A" }
And then in your for loop you printing all items of that array. You can use the following to resolve it:
for(int i=0; i< str1.length ; i++){
if(parts[i].compareTo("") > 0) {
System.out.println("str==="+ parts[i] );
}
}
The split method returns the stuff around every match of the regex. Your regex, [~`], matches to a single character that is either "~" or "`".
The parts of the string separated by matches to that regex are determined as follows:
The string "ABC123-xyz" is returned because it is split off the given string at the character: "`".
In between that character and the next match, "~", is the empty string, and so on.
If you want it to match to more, use [~`]+

Using regex to split sentence into tokens stripping it of all the necessary punctuation excluding punctuation that is part of a word

So I wish to split a sentence into separate tokens. However, I don't want to get rid of certain punctuations that I wish to be part of tokens. For example, "didn't" should stay as "didn't" at the end of a word if the punctuation is not followed by a letter it should be taken out. So, "you?" should be converted to "you" same with the begining: "?you" should be "you".
String str = "..Hello ?don't #$you %know?";
String[] strArray = new String[10];
strArray = str.split("[^A-za-z]+[\\s]|[\\s]");
//strArray[strArray.length-1]
for(int i = 0; i < strArray.length; i++) {
System.out.println(strArray[i] + i);
}
This should just print out:
hello0
don't1
you2
know3
Rather than splitting, you should prefer to use find to find all the tokens as you want with this regex,
[a-zA-Z]+(['][a-zA-Z]+)?
This regex will only allow sandwiching a single ' within it. If you want to allow any other such character, just place it within the character set ['] and right now it will allow only once and in case you want to allow multiple times, you will have to change ? at the end with a * to make it zero or more times.
Checkout your modified Java code,
List<String> tokenList = new ArrayList<String>();
String str = "..Hello ?don't #$you %know?";
Pattern p = Pattern.compile("[a-zA-Z]+(['][a-zA-Z]+)?");
Matcher m = p.matcher(str);
while (m.find()) {
tokenList.add(m.group());
}
String[] strArray = tokenList.toArray(new String[tokenList.size()]);
for (int i = 0; i < strArray.length; i++) {
System.out.println(strArray[i] + i);
}
Prints,
Hello0
don't1
you2
know3
However, if you insist on using split method only, then you can use this regex to split the values,
[^a-zA-Z]*\\s+[^a-zA-Z]*|[^a-zA-Z']+
Which basically splits the string on one or more white space optionally surrounded by non-alphabet characters or split by sequence of one or more non-alphabet and non single quote character. Here is the sample Java code using split,
String str = ".. Hello ?don't #$you %know?";
String[] strArray = Arrays.stream(str.split("[^a-zA-Z]*\\s+[^a-zA-Z]*|[^a-zA-Z']+")).filter(x -> x.length()>0).toArray(String[]::new);
for (int i = 0; i < strArray.length; i++) {
System.out.println(strArray[i] + i);
}
Prints,
Hello0
don't1
you2
know3
Notice here, I have used filter method on streams to filter tokens of zero length as split may generate zero length tokens at the start of array.

How to match strings split by "|e|" sign

I've written a program to split a string by |o| and |e| signs.
This is my whole string (which I want to process):
code|e|0.07610 |o| p|e|0.02225 |o| li|e|0.02032 |o| applet|e|0.01305 |o| pre|e|0.01289
I write a utility function to parse the above string, The following is a part of this utility function :
String [] trs = tgs[1].split("[^ |o| ]"); //tgs[1] have the whole string
for (int i=0 ; i<9; i++) {
String t = trs[i].split("[^|e|]")[0];
e.add(new ProbTranslate(t, Double.parseDouble(trs[i].split("[^|e|]")[1])));
}
But it seems to be incorrect (cause I debug the program and then i get incorrect results). I feel that my mistake is in incorrect regex part. So I seek a proper regex for parsing the above string.
Any help would be appreciated. Thanks.
To quote special characters in regular expressions, Java provides a method: java.util.regex.Pattern#quote
Applying to your example above, this could e.g. lead to
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
final String[] split1 = "code|e|0.07610 |o| p|e|0.02225 |o| li|e|0.02032 |o| applet|e|0.01305 |o| pre|e|0.01289".split(Pattern.quote(" |o| "));
for (int i = 0; i < split1.length; ++i) {
final String name = split1[i];
final String[] split2 = name.split(Pattern.quote("|e|"));
for (int j = 0; j < split2.length; ++j) {
System.out.println(split2[j]);
}
System.out.println("");
}
}
}
Output:
code
0.07610
p
0.02225
li
0.02032
applet
0.01305
pre
0.01289
Solution
Make two changes:
"[^ |o| ]" ➔ "( \\|o\\| )"
"[^|e|]" ➔ "(\\|e\\|)"
With those changes, your code would look like this:
String [] trs = tgs[1].split("( \\|o\\| )");
for (int i=0 ; i<9; i++) {
String t = trs[i].split("(\\|e\\|)")[0];
e.add(new ProbTranslate(t, Double.parseDouble(trs[i].split("(\\|e\\|)")[1])));
}
Explanation
There are three problems with your regex.
String#split(String) splits around the subsequences that match the given regex. Therefore, if you want to split around / remove every |o|, then your regex needs to match |o|. However, it appears that you think (incorrectly) that the regex should match everything other than the split subsequence, since you are using a negated character class. Don't do that.
In order to match (or exclude, for that matter) a complete substring in regex, the substring must be contained in parentheses, e.g. (substring). Parentheses denote a capture group. If you use brackets (e.g. [characters]), then it is a character class, which is equivalent to saying "any of these individual characters" rather than "this complete substring".
The character | is a control character in regex. That means that if you want to match a literal | rather than using it to denote regex alternation, then you must escape it. And since this is Java, you must escape the \ too so that Java doesn't try to change \| to some special Java character before the string even gets to the regex engine. Hence, \\|.

About splits in Java

I'm a Java beginner, so please bear with me if this is an extremely easy answer.
Say I have code that looks like this:
String str;
String [] splits;
str = "The words never line up in such a way ";
splits = str.split(" ");
for (int i = 0; i < splits.length; i++)
System.out.println(splits[i]);
What does Java do at the end of the string? After "way" there is a space; since there is no value after the space does Java decide not to split again?
Thanks so much!
According to the Java documentation for split(), http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String),
The split(String r) is equivalent to the split(String r, 0) method, which will ignore and not include any blank trailing empty strings. Specifically from the docs:
"Trailing empty strings are therefore not included in the resulting
array."
So the last element in the array after the split will be "way"
You can confirm this by executing the code you mentioned.
You will not get any trailing space after delimiter if you use split method. Example
class Main
{
public static void main (String[] args)
{
String str;
String [] splits;
str = "The words never line up in such a way "; // some empty string after delimiter at end
splits = str.split(" ");
for (int i = 0; i < splits.length; i++)
System.out.println(splits[i]);
System.out.println("END");
}
}
OUTPUT
The
words
never
line
up
in
such
a
way
END
see no splitted string for end delimiters.
Now
class Main
{
public static void main (String[] args)
{
String str;
String [] splits;
str = "The words never line up in such a way yeah";
splits = str.split(" ");
for (int i = 0; i < splits.length; i++)
System.out.println(splits[i]);
System.out.println("END");
}
}
OUTPUT
The
words
never
line
up
in
such
a
way
yeah
END
see an extra string after delimiter which is also a empty string but not the trailing, so it will be in the array.
I´ve been looking at javadoc and here what it says about String.split:
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
It seems that this method calls .split with two arguments:
The limit parameter controls the number of times the pattern is
applied and therefore affects the length of the resulting array. If
the limit n is greater than zero then the pattern will be applied at
most n - 1 times, the array's length will be no greater than n, and
the array's last entry will contain all input beyond the last matched
delimiter. If n is non-positive then the pattern will be applied as
many times as possible and the array can have any length. If n is zero
then the pattern will be applied as many times as possible, the array
can have any length, and trailing empty strings will be discarded.
thanks

Categories