Split string after n amount of digits occurrence - java

I'm parsing some folder names here. I have a program that lists subfolders of a folder and parses folder names.
For example, one folder could be named something like this:
"Folder.Name.1234.Some.Info.Here-ToBeParsed"
and I would like to parse it so name would be "Folder Name". At the moment I'm first using string.replaceAll() to get rid of special characters and then there is this 4-digit sequence. I would like to split string on that point. How can I achieve this?
Currently my code looks something like this:
// Parsing string if regex p matches folder's name
if(b) {
//System.out.println("Folder: \" " + name + "\" contains special characters.");
String result = name.replaceAll("[\\p{P}\\p{S}]", " "); // Getting rid of all punctuations and symbols.
//System.out.println("Parsed: " + name + " > " + result);
// If string matches regex p2
if(b2) {
//System.out.println("Folder: \" " + result + "\" contains release year.");
String parsed_name[] = result.split("20"); // This is the line i would like to split when 4-digits in row occur.
//System.out.println("Parsed: " + result + " > " + parsed_name[0]);
movieNames.add(parsed_name[0]);
}
Or maybe there is even easier way to do this? Thanks in advance!

You should keep it simple like this:
String name = "Folder.Name.1234.Some.Info.Here-ToBeParsed";
String repl = name.replaceFirst( "\\.\\d{4}.*", "" ).
replaceAll( "[\\p{P}\\p{S}&&[^']]+", " " );
//=> Folder Name
replaceFirst is removing everything after a DOT and 4 digits
replaceAll is replacing all punctuation and space (except apostrophe) by a single space

Related

Regex: starts with messages and string between parent message curly brace

I want to get all the message data. Such that it should look for message and all the data between curly braces of the parent message. With the below pattern, I am not getting all parent body.
String data = "syntax = \"proto3\";\r\n" +
"package grpc;\r\n" +
"\r\n" +
"import \"envoyproxy/protoc-gen-validate/validate/validate.proto\";\r\n" +
"import \"google/api/annotations.proto\";\r\n" +
"import \"google/protobuf/wrappers.proto\";\r\n" +
"import \"protoc-gen-swagger/options/annotations.proto\";\r\n" +
"\r\n" +
"message Acc {\r\n" +
" message AccErr {\r\n" +
" enum Enum {\r\n" +
" UNKNOWN = 0;\r\n" +
" CASH = 1;\r\n" +
" }\r\n" +
" }\r\n" +
" string account_id = 1;\r\n" +
" string name = 3;\r\n" +
" string account_type = 4;\r\n" +
"}\r\n" +
"\r\n" +
"message Name {\r\n" +
" string firstname = 1;\r\n" +
" string lastname = 2;\r\n" +
"}";
List<String> allMessages = new ArrayList<>();
Pattern pattern = Pattern.compile("message[^\\}]*\\}");
Matcher matcher = pattern.matcher(data);
while (matcher.find()) {
String str = matcher.group();
allMessages.add(str);
System.out.println(str);
}
}
I am expecting response like below in my array list of string with size 2.
allMessage.get(0) should be:
message Acc {
message AccErr {
enum Enum {
UNKNOWN = 0;
CASH = 1;
}
}
string account_id = 1;
string name = 3;
string account_type = 4;
}
and allMessage.get(1) should be:
message Name {
string firstname = 1;
string lastname = 2;
}
First remove the input prior to "message" appearing at the start of the line, then split on newlines followed by "message" (include the newlines in the split so newlines that intervene parent messages are consumed):
String[] messages = data.replaceAll("(?sm)\\A.*?(?=message)", "").split("\\R+(?=message)");
See live demo.
If you actually need a List<String>, pass that result to Arrays.asList():
List<String> = Arrays.asList(data.replaceAll("(?sm)\\A.*?(?=message)", "").split("\\R+(?=message)"));
The first regex matches everything from start up to, but not including, the first line that starts with message, which is replaced with a blank (ie deleted). Breaking the down:
(?sm) turns on flags s, which makes dot also match newlines, and m, which makes ^ and $ match start and end of each line
\\A means the very start of input
.*? .* means any quantity of any character (including newline as per the s flag being set), but adding ? makes this reluctant, so it matches as few characters as possible while still matching
(?=^message) is a look ahead and means the following characters are a start of a line then "message"
See regex101 live demo for a thorough explanation.
The split regex matches one or more line break sequences when they are followed by "message":
\\R+ means one or more line break sequences (all OS variants)
(?=message) is a look ahead and means the following characters are "message"
See regex101 live demo for a thorough explanation.
Try this for your regex. It anchors on message being the start of a line, and uses a positive lookahead to find the next message or the end of messages.
Pattern.compile("(?s)\r\n(message.*?)(?=(\r\n)+message|$)")
// or
Pattern.compile("(?s)\r?\n(message.*?)(?=(\r?\n)+message|$)")
No spliting, parsing, or managing nested braces either :)
https://regex101.com/r/Wa2xxx/1

How to split a java string with a regex?

I am struggling a bit with splitting a string.
Here is an example of an input and the correct output I want:
Input: "Hope you're doing well! I am doing ok. " <--- A few spaces after the period
Output:
[Hope, " ", you're, " ", doing, well, "!" , " ", "I", " ",
"am", " ", "doing", " ", "ok", "." , " ", " ", " ", " ", " "]
I want an output that splits all the words into it's own index (even if it includes an apostrophe). Also, I want all the spaces and punctuation(?, !, ., " ") to have their own index in the array.
Here's what I've tried: I have taken a string message and used the split function. I have used a regex which is giving me almost the correct output, but it's not accounting for extra spaces after the period.
The regex I used:
"\\b |(?=\\p{Punct})|(?<=\\p{Punct}) | "
Anyone have any suggestions? Thank you for your time.
Here is a way to do it but it is rather unorthodox.
String str = "Hope you're doing well! I am doing ok. ";
Establish a regex for all the punctuation, spaces, etc you want using a capture group.
String regex = "([!\\s\\.])";
Then replace each occurrence of those surrounded by a non-relevant character. In this case I used a #. You could actually use several characters together as a split delimiter.
Then split on that character.
String[] tokens = str.replaceAll(regex, "#$1#").split("#");
System.out.println(Arrays.toString(tokens));
Prints
[Hope, , you're, , doing, , well, !, , , I, , am, , doing, , ok, ., , ]
You can get rid of the empty strings ("") as follows:
tokens = Arrays.stream(tokens).filter(s->!s.isEmpty()).toArray(String[]::new);
First you need to define what constitutes the characters of a "word", and anything else should become a separate token. Here I've defined word-characters as letters, numbers, apostrophes, and dashes, so the separate tokens are anything but those:
[^\p{L}\p{N}'-]
Then you build a regex using zero-with positive lookahead and lookbehind for the non-word characters, with a little extra to make sure we don't do a zero-width match at beginning or end of input.
(?<!^)(?=[^\p{L}\p{N}'-])|(?<=[^\p{L}\p{N}'])(?!$)
As Java code, that would be:
String input = "Hope you're doing well! I am doing ok. ";
String regex = "(?<!^)(?=[^\\p{L}\\p{N}'-])|(?<=[^\\p{L}\\p{N}'])(?!$)";
String[] tokens = input.split(regex);
System.out.println(Arrays.stream(tokens).map(s -> '"' + s + '"')
.collect(Collectors.joining(", ", "[", "]")));
Output
["Hope", " ", "you're", " ", "doing", " ", "well", "!", " ", "I", " ",
"am", " ", "doing", " ", "ok", ".", " ", " ", " ", " ", " "]

cannot split a specific kind of strings using Java

I am working in Java. I have list of parameters stored in a string which is coming form excel. I want to split it only at starting hyphen of every new line. This string is stored in every excel cell and I am trying to extract it using Apache poi. The format is as below:
String text =
"- I am string one\n" +
"-I am string two\n" +
"- I am string-three\n" +
"with new line\n" +
"-I am string-four\n" +
"- I am string five";
What I want
array or arraylist which looks like this
[I am string one,
I am string two,
I am string-three with new line,
I am string-four,
I am string five]
What I Tried
I tried to use split function like this:
String[] newline_split = text.split("-");
but the output I get is not what I want
My O/P
[, I am string one,
I am string two,
I am string, // wrong
three // wrong
with new line, // wrong
I am string, // wrong!
four, // wrong!
I am string five]
I might have to tweak split function a bit but not able to understand how, because there are so many hyphens and new lines in the string.
P.S.
If i try splitting only at new line then the line - I am string-three \n with new line breaks into two parts which again is not correct.
EDIT:
Please know that this data inside string is incorrectly formatted just like what is shown above. It is coming from an excel file which I have received. I am trying to use apache poi to extract all the content out of each excel cell in a form of a string.
I intentionally tried to keep the format like what client gave me. For those who are confused about description inside A, I have changed it because I cannot post the contents on here as it is against privacy of my workplace.
You can
remove line separators (replace it with space) if they don't have - after it (in next line): .replaceAll("\\R(?!-)", " ") should do the trick
\R (written as "\\R" in string literal) since Java 8 can be used to represent line separators
(?!...) is negative-look-ahead mechanism - ensures that there is no - after place in which it was used (will not include it in match so we will not remove potential - which ware matched by it)
then remove - placed at start of each line (lets also include followed whitespaces to trim start of the string). In other words replace - placed
after line separators: can be represented by "\\R"
after start of string: can be represented by ^
This should do the trick: .replaceAll("(?<=\\R|^)-\\s*","")
split on remaining line separtors: .split("\\R")
Demo:
String text =
"- I am string one\n" +
"-I am string two\n" +
"- I am string-three\n" +
"with new line\n" +
"-I am string-four\n" +
"- I am string five";
String[] split = text.replaceAll("\\R(?!-)", " ")
.replaceAll("(?<=\\R|^)-\\s*","")
.split("\\R");
for (String s: split){
System.out.println("'"+s+"'");
}
Output (surrounded with ' to show start and end of results):
'I am string one'
'I am string two'
'I am string-three with new line'
'I am string-four'
'I am string five'
This is how I would do:
import java.util.*;
public class MyClass {
public static void main(String args[]) {
String A = "- I am string one \n" +
" -I am string two\n" +
" - I am string-three \n" +
" with new line\n" +
" -I am string-four\n" +
"- I am string five";
String[] s2 = A.split("\r?\n");
List<String> lines = new ArrayList<String>();
String line = "";
for (int i = 0; i < s2.length; i++) {
String ss = s2[i].trim();
if (i == 0) { // first line MUST start with "-"
line = ss.substring(1).trim();
} else if (ss.startsWith("-")) {
lines.add(line);
ss = ss.substring(1).trim();
line = ss;
} else {
line = line + " " + ss;
}
}
lines.add(line);
System.out.println(lines.toString());
}
}
I hope it helps.
A little explanation:
I will process line by line, trimming each one.
If it starts with '-' it means the end of the previous line, so I include it in the list. If not, I concatenate with the previous line.
looks as if you are splitting the FIRST - of each line, so you need to remove every instance of a "newline -"
str.replace("\n-", '\n')
then Remove the initial "-"
str = str.substring(1);

Unwanted elements appearing when splitting a string with multiple separators in Java

I have a string from which I need to remove all mentioned punctuations and spaces. My code looks as follows:
String s = "s[film] fever(normal) curse;";
String[] spart = s.split("[,/?:;\\[\\]\"{}()\\-_+*=|<>!`~##$%^&\\s+]");
System.out.println("spart[0]: " + spart[0]);
System.out.println("spart[1]: " + spart[1]);
System.out.println("spart[2]: " + spart[2]);
System.out.println("spart[3]: " + spart[3]);
System.out.println("spart[4]: " + spart[4]);
But, I am getting some elements which are blank. The output is:
spart[0]: s
spart[1]: film
spart[2]:
spart[3]: fever
spart[4]: normal
My desired output is:
spart[0]: s
spart[1]: film
spart[2]: fever
spart[3]: normal
spart[4]: curse
Try with this:
public static void main(String[] args) {
String s = "s[film] fever(normal) curse;";
String[] spart = s.split("[,/?:;\\[\\]\"{}()\\-_+*=|<>!`~##$%^&\\s]+");
for (String string : spart) {
System.out.println("'"+string+"'");
}
}
output:
's'
'film'
'fever'
'normal'
'curse'
I believe it is because you have a Greedy quantifier for space at the end there. I think you would have to use an escape sequence for the plus sign too.
String spart = s.replaceAll( "\\W", " " ).split(" +");

How to remove commas at the end of any string

I have Strings "a,b,c,d,,,,, ", ",,,,a,,,,"
I want these strings to be converted into "a,b,c,d" and ",,,,a" respectively.
I am writing a regular expression for this. My java code looks like this
public class TestRegx{
public static void main(String[] arg){
String text = ",,,a,,,";
System.out.println("Before " +text);
text = text.replaceAll("[^a-zA-Z0-9]","");
System.out.println("After " +text);
}}
But this is removing all the commas here.
How can write this to achieve as given above?
Use :
text.replaceAll(",*$", "")
As mentioned by #Jonny in comments, can also use:-
text.replaceAll(",+$", "")
Your first example had a space at the end, so it needs to match [, ]. When using the same regular expression multiple times, it's better to compile it up front, and it only needs to replace once, and only if at least one character will be removed (+).
Simple version:
text = text.replaceFirst("[, ]+$", "");
Full code to test both inputs:
String[] texts = { "a,b,c,d,,,,, ", ",,,,a,,,," };
Pattern p = Pattern.compile("[, ]+$");
for (String text : texts) {
String text2 = p.matcher(text).replaceFirst("");
System.out.println("Before \"" + text + "\"");
System.out.println("After \"" + text2 + "\"");
}
Output
Before "a,b,c,d,,,,, "
After "a,b,c,d"
Before ",,,,a,,,,"
After ",,,,a"

Categories