Consider the following command line: tfile -a -fn P2324_234.w07 -tc 8811
The regex to parse this: -\w+|\w+\s|\w+\.+\w+\s (see screenshot below)
The problem is when the file name has multiple dots, say: tfile -a -fn P23.24.23.4.w07 -tc 8811
Question: how to ensure the P23.24.23.4.w07 is parsed as one argument (as in P23.24.23.4.w07)?
Describe it!
For: P23.24.23.4.w07
use: \w+(?:\.\w+)+
note that for your java version you can use possessive quantifiers and atomic groups:
\\w++(?>\\.\\w++)+
Use a character class, e.g., /-fn [a-z0-9.]+ -tc/i. In English, that means "-fn, followed by one or more of characters between a-z, between 0-9, or a ., followed by -tc." If you want to capture that part, wrap that part in parentheses.
I have used this
-\w+|\w+\s|\S+.+\w+\s
Instead of 'word', we may use 'not space', You have not specified your extra requirement so I think it is fine.
Use a quantifier:
-\w+|\w+\s|(?:\w+\.+)+\w+\s
^^^ ^^
You can also simply your expression to:
-?\w+\s?|(?:\w+\.+)+\w+\s
For doing this in java, all you need to do is split it along the spaces, no regex needed. The good ole String.split() should be able to handle it.
Related
I want to Capture an alphanumeric group in regex such that it does not capture starting underscore. For example _reverse(abc) should return reverse(. I am using (?<name>\w+) but it return _reverse(.
You can try this,
[^a-zA-Z0-9()\\s+]
The output will be reverse(abc)
You can specify characters explicitly, e.g.:
[a-zA-Z0-9]+
From what you are showing, I assume you want to strip underscores and content behind the opening parentheses.
Basically, that should work with a regex like this:
"_([a-zA-Z0-9]+\()"
this can be used in conjunction with a Matcher to extract all capturing groups (in this case, [a-zA-Z0-9]+\() and return them.
Note that you can find almost all the help you need with Regular Expressions on utility sites like RegEx 101 and RegEx Per, the latter being a nice visualizer but only working with javaScript-like expressions.
Also, RegEx 101 contains a Regex Debugger to help avoid dangerous regular expressions
I used regex101 to make my expression, and it looks like this using their symbols
\d+ [+-\/*] \d*
Basically I want a user to enter like 123 + 123 but the entire statement is one string with exactly one space after the first number and one space after the operator
The above expression works, but It doesn't convert the same into Java.
I thought these symbols were universal, but I guess not. Any ideas how to convert this to the proper syntax?
Regular expressions are not universal.
In general,
no two regular expression systems are the same.
Java does not have regular expressions.
Some Java classes support regular expressions.
The Pattern class defines the regular expressions that are used by some Java classes including Matcher which seems likely to be the class you are using.
As already identified in the comments,
\ is the escape-the-next-character character in Java.
If you want to represent \ in a String,
you must use \\.
For example,
\d in a regular expression must be written \\d in a Java String.
You can simply use groups () and design a RegEx as you wish. This RegEx might be one way to do so:
((\d+\s)(\+|\-)(\s\d+))
It has four groups, and you can simply call the entire input using $1:
You can also escape \ those required language-based chars.
I am trying to modify an existing Regex expression being pulled in from a properties file from a Java program that someone else built.
The current Regex expression used to match an email address is -
RR.emailRegex=^[a-zA-Z0-9_\\.]+#[a-zA-Z0-9_]+\\.[a-zA-Z0-9_]+$
That matches email addresses such as abc.xyz#example.com, but now some email addresses have dashes in them such as abc-def.xyz#example.com and those are failing the Regex pattern match.
What would my new Regex expression be to add the dash to that regular expression match or is there a better way to represent that?
Basing on the regex you are using, you can add the dash into your character class:
RR.emailRegex=^[a-zA-Z0-9_\\.]+#[a-zA-Z0-9_]+\\.[a-zA-Z0-9_]+$
add
RR.emailRegex=^[a-zA-Z0-9_\\.-]+#[a-zA-Z0-9_-]+\\.[a-zA-Z0-9_-]+$
Btw, you can shorten your regex like this:
RR.emailRegex=^[\\w.-]+#[\\w-]+\\.[\\w-]+$
Anyway, I would use Apache EmailValidator instead like this:
if (EmailValidator.getInstance().isValid(email)) ....
Meaning of - inside a character class is different than used elsewhere. Inside character class - denotes range. e.g. 0-9. If you want to include -, write it in beginning or ending of character class like [-0-9] or [0-9-].
You also don't need to escape . inside character class because it is treated as . literally inside character class.
Your regex can be simplified further. \w denotes [A-Za-z0-9_]. So you can use
^[-\w.]+#[\w]+\.[\w]+$
In Java, this can be written as
^[-\\w.]+#[\\w]+\\.[\\w]+$
^[a-zA-Z0-9_\\.\\-]+#[a-zA-Z0-9_]+\\.[a-zA-Z0-9_]+$
Should solve your problem. In regex you need to escape anything that has meaning in the Regex engine (eg. -, ?, *, etc.).
The correct Regex fix is below.
OLD Regex Expression
^[a-zA-Z0-9_\\.]+#[a-zA-Z0-9_]+\\.[a-zA-Z0-9_]+$
NEW Regex Expression
^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$
Actually I read this post it covers all special cases, so the best one that's work correctly with java is
String pattern ="(?:[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)*|\"(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21\\x23-\\x5b\\x5d-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])*\")#(?:(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?|\\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-zA-Z0-9-]*[a-zA-Z0-9]:(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21-\\x5a\\x53-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])+)\\])";
What is the best approach if for instance a question mark is expected in a String.
...[?]...
or
...\?...
Example:
The text bla?bla will match both with the pattern bla[?]bla and bla\?bla (bot not bla?bla obviously) but is there any reason to use one over the other?
There is no technical reason to prefer one over the other: They are equivalent expressions. The character class is only used to avoid entering a backslash, so IMHO the escaped version is "cleaner"
However the reason may be to avoid double-escaping the slash on input. In languages like java, the literal version of the escaped version would look like this:
// in java you need to escape a backslash with another backslash :(
String regex = "...\\?...";
It could be that wherever the regexes are coming from has a similar issue and it's easier to read [?] than \\?
I'm looking for a reg expression which has the exact same meaning as the "*" operator in a linux / windows command line. For example, find all files that: starts with 0 or more random chars, contains "abc" in the middle, and ends with 0 or more random chars.
So something like this in Java:
if (test.match("*abc*"))
System.out.println("found match");
Original answer:
.*abc.*
Is the regexp which solves your problem. Note that if you want to match newline as part of your test string, you might need to enable single-line mode.
Revised answer if you are really talking about files:
[^/]*abc[^/]*
is a better answer since globs do not actually match directories in "*". For example, /etc/*bar will match /etc/foobar but will not match /etc/foo/bar. However, you said you were not interested in filenames, so the difference may be irrelevant to you.
* in Unix is expressed as (.*) in regular expressions.
if (test.match("(.*)abc(.*)")) { /* ... */ }
Sounds like you want to 'glob' more than a full regex. Check this page out http://download.oracle.com/javase/tutorial/essential/io/find.html