Regular expression to match a line and extract file name in Java - java

I have a String in following format
Index: /aap/guru/asdte/atsAPI.tcl
===================================================================
RCS file: /autons/atsAPI.tcl,v
retrieving revision 1.41
Index: /aap/guru/asdte/atsAPI1.tcl
===================================================================
RCS file: /autons/atsAPI1.tcl,v
retrieving revision 1.41
What I want is to match a line start with Index: and then get the file name from path.
I mean first get Index: /aap/guru/asdte/atsAPI.tcl and then extract atsAPI.tcl as final result.
Currently I am using matching twice, first whole line and then extracting file name.
My question is, how to do it in a single regular expression in java.
Current Code is
String line = "Index: /aap/guru/asdte/atsAPI.tcl\r\n===================================================================\r\nRCS file: /autons/atsAPI.tcl,v\r\nretrieving revision 1.41\r\n\r\nIndex: /aap/guru/asdte/atsAPI1.tcl\r\n===================================================================\r\nRCS file: /autons/atsAPI1.tcl,v\r\nretrieving revision 1.41";
Pattern regex1 = Pattern.compile("Index:.*?\\n", Pattern.DOTALL);
Pattern regex2 = Pattern.compile("[^*/]+$");
Matcher matcher1 = regex1.matcher(line);
while (matcher1.find()) {
String s = matcher1.group(0);
Matcher matcher2 = regex2.matcher(s);
while (matcher2.find()) {
System.out.println(matcher2.group(0));
}
}

how to do it in a single regular expression in java.
Use a capturing group as shown below.
Regular Expression:
^Index:.*\/(.*)
Now the filename can be obtained by using matcher.group(1) and is represented by the last part (.*) in the regex
^ matches starting anchor
Index: matches the literal as-is
.* matches anything (greedy)
\/ matches a slash /
(.*) matches the filename in a capturing group
Make sure (?m) or Pattern.MULTILINE flag is set so that the matching is multi line and matches the starting anchor ^ at the start of every line.
Regex101 Demo
EDIT: Modify your code to use only one regex, like this:
Pattern pattern = Pattern.compile("^Index:.*\\/(.*)", Pattern.MULTILINE);
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
// Output:
atsAPI.tcl
atsAPI1.tcl
Demo

Try this ^Index.+\/([^\.]+\.\w+)$ with the gm flags or Index.+\/([^\.]+\.\w+) without the m flag. The only capturing group is for the name of the file.

Try the following regex, the answer is in the first match group:
Index:.*?\/([\w]+\.[\w]*)
You can debug it in the following link:
Regex link

Related

Ignore creating beginnings of words in a regular expression

I'm trying to parse all the links in a message.
My Java-Code looks the following:
Pattern URLPATTERN = Pattern.compile(
"([--:\\w?#%&+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&](?:\\w+)=(?:\\w+))+|[--:\\w?#%&+~#=]+)?",
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
Matcher matcher = Patterns.URLPATTERN.matcher(message);
ArrayList<int[]> links = new ArrayList<>();
while (matcher.find())
links.add(new int[] {matcher.start(1), matcher.end()});
[...]
The problem now is that the links sometimes start with a colour-code that looks the following: [&§]{1}[a-z0-9]{1}
An example could be: Please use Google: §ehttps://google.com, and don't ask me.
With the regex expression, I found somewhere on the internet it will match the following: ehttps://google.com but it should only match https://google.com
Now how can I change the regular expression above to exclude the following pattern but still match the link that follows just after the color-code?
[&§]{1}[a-z0-9]{1}
You can add a (?:[&§][a-z0-9])? pattern (matching an optional sequence of a & or § and then an ASCII letter or digit) at the beginning of your regex:
Pattern URLPATTERN = Pattern.compile(
"(?:[&§][a-z0-9])?([--:\\w?#%&+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&]\\w+=\\w+)+|[--:\\w?#%&+~#=]+)?", Pattern.CASE_INSENSITIVE);
See the regex demo.
When the regex finds §ehttps://google.com, the §e is matched with the optional non-capturing group (?:[&§][a-z0-9])?, that is why it is "excluded" from the Group 1 value.
There is no need using Pattern.MULTILINE | Pattern.DOTALL with your regex, there is no . and no ^/$ in the pattern.

How to capture multiple groups in regex?

I am trying to capture following word, number:
stxt:usa,city:14
I can capture usa and 14 using:
stxt:(.*?),city:(\d.*)$
However, when text is;
stxt:usa
The regex did not work. I tried to apply or condition using | but it did not work.
stxt:(.*?),|city:(\d.*)$
You may use
(stxt|city):([^,]+)
See the regex demo (note the \n added only for the sake of the demo, you do not need it in real life).
Pattern details:
(stxt|city) - either a stxt or city substrings (you may add \b before the ( to only match a whole word) (Group 1)
: - a colon
([^,]+) - 1 or more characters other than a comma (Group 2).
Java demo:
String s = "stxt:usa,city:14";
Pattern pattern = Pattern.compile("(stxt|city):([^,]+)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}
Looking at your string, you could also find the word/digits after the colon.
:(\w+)

Substring between lines using Regular Expression Java

Hi I am having following string
abc test ...
interface
somedata ...
xxx ...
!
sdfff as ##
example
yyy sdd ## .
!
I have a requirement that I want to find content between a line having word "interface" or "example" and a line "!".
Required output will be something like below
String[] output= {"somedata ...\nxxx ...\n","yyy sdd ## .\n"} ;
I can do this manually using substring and iteration . But I want to achieve this using regular expression.
Is it possible?
This is what I have tried
String sample="abc\ninterface\nsomedata\nxxx\n!\nsdfff\ninterface\nyyy\n!\n";
Pattern pattern = Pattern.compile("(?m)\ninterface(.*?)\n!\n");
Matcher m =pattern.matcher(sample);
while (m.find()) {
System.out.println(m.group());
}
Am I Right? Please suggest a right way of doing it .
Edit :
A small change : I want to find content between a line "interface" or "example" and a line "!".
Can we achieve this too using regex ?
You could use (?s) DOTALL modifier.
String sample="abc\ninterface\nsomedata\nxxx\n!\nsdfff\ninterface\nyyy\n!\n";
Pattern pattern = Pattern.compile("(?s)(?<=\\ninterface\\n).*?(?=\\n!\\n)");//Pattern.compile("(?m)^.*$");
Matcher m =pattern.matcher(sample);
while (m.find()) {
System.out.println(m.group());
}
Output:
somedata
xxx
yyy
Note that the input in your example is different.
(?<=\\ninterface\\n) Asserts that the match must be preceded by the characters which are matched by the pattern present inside the positive lookbehind.
(?=\\n!\\n) Asserts that the match must be followed by the characters which are matched by the pattern present inside the positive lookahead.
Update:
Pattern pattern = Pattern.compile("(?s)(?<=\\n(?:example|interface)\\n).*?(?=\\n!\\n)");

How to extract date section from filename?

I need to find a regex to extract date section from the name of several files.
In particular I have these two formats:
ATC0200720140828080610.xls
ATC0200720140901080346_UFF_ACC.xls
I use these two regex to check file name format:
^ATC02007[0-9]{14}.xls$
^ATC02007[0-9]{14}_UFF_ACC.xls$
But I need a regex to extract a specific section:
constant | yyyyMMddHHmmss | constant
^ ^ ^
ATC02007 | 20140901080346 | _UFF_ACC.xls
Both regex I'm using match the entire file name, so I can't use to extract the middle section, so which is the right expression?
You are almost there. Just use round brackets to contain the numbers you want.
^ATC02007([0-9]{14})(_UFF_ACC)?.xls$
See example. The numbers are captured in group 1$1.
You need to use capturing groups.
^(ATC02007)([0-9]{14})((?:[^.]*)?\\.xls)$
DEMO
GRoup index 1 contains the first constant and group 2 contains date and time and group 3 contains the third constant.
String s = "ATC0200720140828080610.xls\n" +
"ATC0200720140901080346_UFF_ACC.xls";
Pattern regex = Pattern.compile("(?m)^(ATC02007)([0-9]{14})((?:[^.]*)?\\.xls)$");
Matcher matcher = regex.matcher(s);
while(matcher.find()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));
}
Output:
ATC02007
20140828080610
.xls
ATC02007
20140901080346
_UFF_ACC.xls

how to get all names and date of births from a specific file using java

Hi below is my text file
welcome to java training
program
Name rtrti*&*
John
address india say^%$7
Date of Birth
11/12/1989
I have 100 files like above.The above text is the extracted text from the image files so it is not in order, from this i need to get the names and date of births can you please suggest me how to do this, I am new to this task.
Required output
John
11/12/1989
I have tried
Pattern p = Pattern.compile("Name");
Matcher matcher = p.matcher(content);
matcher.find();
But I have know idea how to get the next line of matched pattern, I cant not read this file line by line because my need is to store entire text in a single string.
I'll give a few hints that will get you on track. Without more details regarding the expected input, it will be difficult to give you a solid solution. First, I trust that you are already familiar with the Pattern and Matcher javadocs. You will need to understand the Groups and capturing section. Finally, you can utilize DOTALL mode which will allow the . character to match newlines.
To get you started, the following should work to find the name:
Pattern p = Pattern.compile(
"(?s)" + // DOTALL
".*" + // Match anything (to consume everything before 'Name')
"Name" + // Match the literal 'Name'
".*?" + // Reluctantly grab everything until...
"\n" + // Newline is reached
"\\s*" + // Consume leading whitespace
"(\\S+)" // Capture at least one non-whitespace character
);
Matcher m = p.matcher(content);
if(m.find()) {
String name = m.group(1); // The first capturing group contains "John"
}

Categories