Apply regex on captured group - java

I'm new to Java and to regex in particular
I have a CSV file that look something like :
col1,col2,clo3,col4
word1,date1,date2,port1,port2,....some amount of port
word2,date3,date4,
....
What I would like is to iterate over each line (I suppose I'll do it with simple for loop) and get all ports back.
I guess what I need is the fetch every thing after the two dates and look for
,(\d+),? and the group that comes back
My question(s) is :
1) Can it be done with one expression? (meaning, without storing the result in a string and then apply another regex)
2) Can I maybe incorporate the iteration over the lines into the regex?

There are many ways to do it, I will show a few for educational purpose.
I put your input in a String just for the example, you will have to read it properly. I also store the results in a List and print them at the end:
public static void main(String[] args) {
String source = "col1,col2,clo3,col4" + System.lineSeparator() +
"word1,date1,date2,port1,port2,port3" + System.lineSeparator() +
"word2,date3,date4";
List<String> ports = new ArrayList<>();
// insert code blocks bellow
System.out.println(ports);
}
Using Scanner:
Scanner scanner = new Scanner(source);
scanner.useDelimiter("\\s|,");
while (scanner.hasNext()) {
String token = scanner.next();
if (token.startsWith("port"))
ports.add(token);
}
Using String.split:
String[] values = source.split("\\s|,");
for (String value : values) {
if (value.startsWith("port"))
ports.add(value);
}
Using Pattern-Matcher:
Matcher matcher = Pattern.compile("(port\\d+)").matcher(source);
while (matcher.find()) {
ports.add(matcher.group());
}
Output:
[port1, port2, port3]
If you know where the "ports" are located in the file, you can use that info to slightly increase performance by specifying the location and getting a substring.

Yes, it can be done in one line:
first remove all non-port terms (those containing a non-digit)
then split the result of step one on commas
Here's the magic line:
String[] ports = line.replaceAll("(^|(?<=,))[^,]*[^,\\d][^,]*(,|$)", "").split(",");
The regex says "any term that has a non-digit" where a "term" is a series of characters between start-of-input/comma and comma/end-of-input.
Conveniently, the split() method doesn't return trailing blank terms, so no need worry about any trailing commas left after the first replace.
In java 8, you can do it in one line, but things are much more straightforward:
List<String> ports = Arrays.stream(line.split(",")).filter(s -> s.matches("\\d+")).collect(Collectors.toList());
This streams the result of a split on commas, then filters out non-all-numeric elements, them collects the result.
Some test code:
String line = "foo,12-12-12,11111,2222,bar,3333";
String[] ports = line.replaceAll("(^|(?<=,))[^,]*[^,\\d][^,]*(,|$)", "").split(",");
System.out.println(Arrays.toString(ports));
Output:
[11111, 2222, 3333]
Same output in java 8 for:
String line = "foo,12-12-12,11111,2222,bar,3333,baz";
List<String> ports = Arrays.stream(line.split(",")).filter(s -> s.matches("\\d+")).collect(Collectors.toList());

Related

Split a string of multiple sentences into single sentences and surround them with html tags

I am a Java beginner and currently looking for a method to Split a String message into substrings, based on delimiter ( . ). Ideally I have single sentences then and I want to wrap each sentence in HTML tags, i. e. <p></p>.
I tried the following with BreakIterator class:
BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.ENGLISH);
List<String> sentences = new ArrayList<String>();
iterator.setText(message);
int start = iterator.first();
String newMessage= "";
for (int end = iterator.next();
end != BreakIterator.DONE;
start = end, end = iterator.next()) {
newMessage= "<p>"+ message.substring(start,end) + "</p>";
sentences.add(newMessage);
}
This gives back one sentence. I am stuck here, I also want to wrap each number in a each sentence.
The String I have contains something like:
String message = "Hello, John. My phone number is: 02365897458.
Please call me tomorrow morning, at 8 am."
The output should be:
String newMessage = "<p>Hello, John.</p><p>My phone number is:
<number>02365897458</number>.
</p><p>Please call me tomorrow morning, at 8 am.</p>"
Is there a possibility to achieve this?
Try the split method on Java String. You can split on . and it will return an array of Strings.
https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#split-java.lang.String-
This can easily be done using the StringTokenizer class, along with the StringBuilder class:
String message = SOME_STRING;
StringBuilder builder = new StringBuilder();
StringTokenizer tokenizer = new StringTokenizer(message, ".");
while(tokenizer.hasMoreTokens()) {
builder.append("<p>");
builder.append(tokenizer.nextToken());
builder.append("</p>");
}
return builder.toString();
You can add more delimiters as required for various tags.
Surrounding sentences could be archived by adding a <p> at the start, a </p> at the end and replacing each full-stop with .</p><p>. Take a look at the replace method for strings.
And to add the number tag, you could use a regex replace. The replaceAll method and a regex like [0-9]+, depending on what your numbers look like, can do that.
Something similar to this should work (untested):
newMessage = "<p>" + message.replace(".", ".</p><p>")
.replaceAll("([0-9]+)", "<number>$1</number>") +
"</p>"
As said above, you can use the split method. Because you're splitting on dots be sure to escape this in your regex. A simple example (there are other ways to keep the delimiter but I've done it like this for simplicity as you're beginning);
String toSplit = "Hello, John. My phone number is: 02365897458. Please call me tomorrow morning, at 8 am.";
String[] tokens = toSplit.split("\\.");
for(String token : tokens) {
token = "<p>" + token + ".</p>";
}

multiple sections in a csv row

I have a csv file formatted
<F,Bird,20,10/> < A,Fish,5,11,2/>
I was wondering how to read in those values separately.
Would I have to get the whole line to an array?
I have thought of doing line.split("/>") but then the first data would have < in them which I don't want.
If I on the other hand just seperate it using line.split(",") and then assign each values accordingly the values in the middle would merge so that does not work neither.
Is there a way to separate the string first without the <>/ symbols?
You can use several delimiters in split regexp, like this:
String line = "<F,Bird,20,10/> < A,Fish,5,11,2/>";
String[] lines = line.split("<|/> <|/>");
for (String item: lines) {
System.out.println(item);
}
Output (with all your spaces):
F,Bird,20,10
A,Fish,5,11,2
Try splitting your input string using the lookbehind ?<=/>:
String input = "<F,Bird,20,10/> < A,Fish,5,11,2/>";
input = input.replaceAll("\\s+", "");
String[] parts = input.split("(?<=/>)");
for (String part : parts) {
System.out.println(part.replaceAll("[<>/]", ""));
}
Note that I removed all spaces from your string to make splitting cleaner. We could still try to split with arbitrary whitespace present, but it would be more work. From this point, you can easily access the CSV data contained within each tag.
Output:
F,Bird,20,10
A,Fish,5,11,2
Demo here:
Rextester

How to return only first n number of words in a sentence Java

Say i have a simple sentence as below.
For example, this is what have:
A simple sentence consists of only one clause. A compound sentence
consists of two or more independent clauses. A complex sentence has at
least one independent clause plus at least one dependent clause. A set
of words with no independent clause may be an incomplete sentence,
also called a sentence fragment.
I want only first 10 words in the sentence above.
I'm trying to produce the following string:
A simple sentence consists of only one clause. A compound
I tried this:
bigString.split(" " ,10).toString()
But it returns the same bigString wrapped with [] array.
Thanks in advance.
Assume bigString : String equals your text. First thing you want to do is split the string in single words.
String[] words = bigString.split(" ");
How many words do you like to extract?
int n = 10;
Put words together
String newString = "";
for (int i = 0; i < n; i++) { newString = newString + " " + words[i];}
System.out.println(newString);
Hope this is what you needed.
If you want to know more about regular expressions (i.e. to tell java where to split), see here: How to split a string in Java
If you use the split-Method with a limiter (yours is 10) it won't just give you the first 10 parts and stop but give you the first 9 parts and the 10th place of the array contains the rest of the input String. ToString concatenates all Strings from the array resulting in the whole input String. What you can do to achieve what you initially wanted is:
String[] myArray = bigString.split(" " ,11);
myArray[10] = ""; //setting the rest to an empty String
myArray.toString(); //This should give you now what you wanted but surrouned with array so just cut that off iterating the array instead of toString or something.
This will help you
String[] strings = Arrays.stream(bigstring.split(" "))
.limit(10)
.toArray(String[]::new);
Here is exactly what you want:
String[] result = new String[10];
// regex \s matches a whitespace character: [ \t\n\x0B\f\r]
String[] raw = bigString.split("\\s", 11);
// the last entry of raw array is the whole sentence, need to be trimmed.
System.arraycopy(raw, 0, result , 0, 10);
System.out.println(Arrays.toString(result));

Matching and sorting a Bukkit ChatColor expression

I'm splitting up a String by spaces and then checking each piece if it contains a code (&a, &l, etc). If it matches, I have to grab the codes that are beside each other and then order them alphanumerically (0, 1, 2... a, b, c...).
Here is what I tried so far:
String message = "&l&aCheckpoint&m&6doreime";
String[] parts = message.split(" "); // This may not be needed for the example, but I'm only using one word for simplicity here
List<String> orderedMessage = new ArrayList<>();
Pattern pattern = Pattern.compile("((?:&|\u00a7)[0-9A-FK-ORa-fk-or])(.*?)"); // Completely matches the entire pattern, not what i want
for (String part : parts) {
if (pattern.matcher(part).matches()) {
List<String> orderedParts = new ArrayList<>();
// what do i do?
}
}
I need to change the pattern value so it matches groups like this:
Match: &l&aCheckpoint
Groups that I need: [&l, &a, Checkpoint]
Match: &m&6doreime
Groups that I need: [&m, &6, doreime]
How can I match each shown Match and split it into the 3 groups (where it splits each code section (&[0-9A-FK-ORa-fk-or]) and the remaining text until another code section?
Info: For anyone who is wondering why, when you submit color/format coded text to Minecraft, colors have to come first, or the format ([a-fk-or]) codes are ignored because of how Minecraft has parsed color codes since 1.5. By sorting them and rebuilding the message, it won't rely on users or developers getting the order correct.
You can get what you are after by using a slightly more complicated regex
(((?:&|§)[0-9A-FK-ORa-fk-or])+)([^&]*)
Breaking it down we have two important capturing groups
(((?:&|§)[0-9A-FK-ORa-fk-or])+)
This will capture one or more code sections of and & followed by a character
([^&]*)
The second grabs any number of non & characters which will get you the remainder of that section. (This is slightly different behavior than the regex you provided - things more complicated if & is a legal character in the string.
Putting that regex into use with a Matcher you can do the following,
String input = "&l&aCheckpoint&m&6doreime";
Pattern pattern = Pattern.compile("(((?:&|§)[0-9A-FK-ORa-fk-or])+)([^&]*)");
Matcher patternMatcher = pattern.matcher(input);
while(patternMatcher.find()){
String[] codes = patternMatcher.group(1).split("(?=&)");
String rest = patternMatcher.group(3);
}
Which will loop twice, giving you
codes = ["&l", "&a"]
rest = "Checkpoint"
on the first loop and the following on the second
codes = ["&m", "&6"]
rest = "doreime"

How do I fill a new array with split pieces from an existing one? (Java)

I'm trying to split paragraphs of information from an array into a new one which is broken into individual words. I know that I need to use the String[] split(String regex), but I can't get this to output right.
What am I doing wrong?
(assume that sentences[i] is the existing array)
String phrase = sentences[i];
String[] sentencesArray = phrase.split("");
System.out.println(sentencesArray[i]);
Thanks!
It might be just the console output going wrong. Try replacing the last line by
System.out.println(java.util.Arrays.toString(sentencesArray));
The empty-string argument to phrase.split("") is suspect too. Try passing a word boundary:
phrase.split("\\b");
You are using an empty expression for splitting, try phrase.split(" ") and work from there.
This does nothing useful:
String[] sentencesArray = phrase.split("");
you're splitting on empty string and it will return an array of the individual characters in the string, starting with an empty string.
It's hard to tell from your question/code what you're trying to do but if you want to split on words you need something like:
private static final Pattern SPC = Pattern.compile("\\s+");
.
.
String[] words = SPC.split(phrase);
The regex will split on one or more spaces which is probably what you want.
String[] sentencesArray = phrase.split("");
The regex based on which the phrase needs to be split up is nothing here. If you wish to split it based on a space character, use:
String[] sentencesArray = phrase.split(" ");
// ^ Give this space

Categories