I'm attempting to match anything between different delimiters, using a Java based regex engine/interpreter. The text I'm after is the server.domain.com I do not believe I can use any Java either, only a regular expression. The log will only have one OR the other, never both. This must be accomplished with a single regex, or the application will have to be re-written.
Examples of the logs:
Host = server.domain.com|
OR
Host="server.domain.com"
Thus far I've tried the following, along with a number of other combinations...
Host="(.*?)"|Host\s=\s(.*?)\|
I must also use Host as part of the delimiter, as it is parsing out of a log with many other similar pieces.
Thanks for any help on this!
For the example given, you could use:
^Host\s*=\s*(?:")?(?:[^|"])+[|"]$
Debuggex Demo
it will also accept
host=server.domain.com"
but if the logs are either / or that shouldn't be an issue.
Try this one with both the string
String str1 = "Host = server.domain.com|";
String str2 = "Host=\"server.domain.com\"";
//Host(no or one space)=(" or one space)server.domain.com(| or ")
Pattern p = Pattern.compile("Host\\s?=[\\\"|\\s]server.domain.com[\\||\\\"]");
Matcher m = p.matcher(str1);
if (m.find()) {
System.out.println("found");
}
You can try this one also if no of spaces are not known on either side of equal to.
//Host(zero or more spaces)=(zero or more spaces)(" or spaces)server.domain.com(" or |)
Pattern p = Pattern.compile("Host\\s*=\\s*[\\\"|\\s*]server.domain.com[\\\"|\\|]");
Thanks to aliteral mind, I learned about non capture groups and that was key...
Behold!...
Host(?:\s=\s|=\")(.*?)(?:\||\")
Related
I am using Java to implement PDF to plain text conversion. Right now I am facing the problem of filtering out ID expressions from String representation of the text.
The idea here is to capture IDs as whole words of length only greater than 4 and remove them. IDs must comprise of both letters and numbers at the same time, in any order. They can have optional special symbols like :.- and are generally all uppercase except several cases when there might be one and (for now) exactly one lowercase letter in them. IDs can be encountered at any place in the sentence, and there are multiple sentences inside the String. I am also trying to capture the preceding space (if there is one) so there is no double space after I remove the ID. It is acceptable to split the expression into several pieces if it gets too complex.
I've created a small test snippet to show exactly what needs and doesn't need to be caught by the regular expression, as well as display my progress so far. I am using standard java.util.regex package for implementation.
String testString = "Remove this (ACTDIK002), ACTDIK002, (L1:3.CI), 9-12.CT.d.12, and 1A-CS-01 "
+ "but not (DLCS), 781-338-3000, (DTC), (200), K-12, K or 12. "
+ "Also not (), A.I., AI, A or a. . ...";
System.out.println(testString);
String regex = "[\\s]{0,1}[[A-Z]+[\\d]+[-:\\(\\)\\.]*]{4,}[a-z]{0,1}[\\d\\.]*";
//"[\\s]{0,1}[[A-Z]+[\\d]+[-:\\(\\)\\.]*]{4,}[[a-z]{0,1}[\\d\\.]+]*" //for comma removal
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(testString);
testString = matcher.replaceAll("*");
System.out.println(testString);
It may be necessary to remove IDs together with their commas, so it would be great if the revised expression was capable of capturing commas or omitting them via minor alterations like the alternative regex I've provided.
My current solution filters out everything that needs to be filtered but also most of the things it shouldn't. It appears the rule that there must be at least one capital letter and one digit in the word isn't working, possibly because I need to use Lookahead/Lookbehind/Grouping, sadly none of which I managed to get to work properly. I also suspect the use of [] is completely incorrect in my example, but this is the only way I managed to get it to (mostly) work for now. Please help me.
My colleague and I were able to solve this issue in an elegant way. Below is a snippet from my current solution. I hope one day this proves useful to someone.
String testString = "Remove this (ACTDIK002), ACTDIK002, (L1:3.CI), 9-12.CT.d.12, and 1A-CS-01 "
+ "but not (DLCS), 781-338-3000, (DTC), (200), K-12, K or 12. "
+ "Also not (), A.I., AI, A or a. . ...";
System.out.println(testString);
String regex = "(?i)(?=[\\dA-Z\\(\\)\\.:-]*\\d)(?=[\\dA-Z\\(\\)\\.:-]*[A-Z])[\\dA-Z\\(\\)\\.:-]{5,}";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(testString);
testString = matcher.replaceAll("");
System.out.println(testString);
//Clean-up extra spaces and unneeded commas
//testString = testString.replaceAll("\\s{2,}", " ").replaceAll("(\\s\\.)|(\\s\\,)", "");
testString = testString.replaceAll("[ ]{2,}", " ").replaceAll("([ ]\\.)|([ ]\\,)", "");
System.out.println(testString);
I'm trying to extract a string from a String in Regex Java
Pattern pattern = Pattern.compile("((.|\\n)*).{4}InsurerId>\\S*.{5}InsurerId>((.|\\n)*)");
Matcher matcher = pattern.matcher(abc);
I'm trying to extract the value between
<_1:InsurerId>F2021633_V1</_1:InsurerId>
I'm not sure where am I going wrong but I don't get output for
if (matcher.find())
{
System.out.println(matcher.group(1));
}
You can use:
Pattern pattern = Pattern.compile("<([^:]+:InsurerId)>([^<]*)</\\1>");
Matcher matcher = pattern.matcher(abc);
if (matcher.find()) {
System.out.println(matcher.group(2));
}
RegEx Demo
You may want to use the totally awesome page http://regex101.com/ to test your regular expressions. As you can see at https://regex101.com/r/rV8uM3/1, you only have empty capturing groups, but let me explain to you what you did. :D
((.|\n)*) This matches any character, or a new line, unimportant how often. It is capturing, so your first matching group will always be everything before <_1:InsurerId>, or an empty string. You can match any character instead, it will include new lines: .*. You can even leave it away as it isn't actually part of the String you want to match - using anything here will actually be a problem if you have multiple InsurerIds in your file and want to get them all.
.{4}InsurerId> This matches "InsurerId>" with any four characters in front of it and is exactly what you want. As the first character is probably always an opening angle bracket (and you don't want stuff like "<ExampleInsurerId>"), I'd suggest using <.{3}InsurerId> instead. This still could have some problems (<Test id="<" xInsurerId>), so if you know exactly that it's "_<a digit>:", why not use <_\d:InsurerId>?
\S* matches everything except for whitespaces - probably not the best idea as XML and similar files can be written to not contain any space at all. You want to have everything to the next tag, so use [^<]* - this matches everything except for an opening angle bracket. You also want to get this value later, so you have to use a capturing group: ([^<]*)
.{5}InsurerId> The same thing here: use <\/.{3}InsurerId> or <\/_\d:InsurerId> (forward slashes are actually characters interpreted by other RegEx implementations, so I suggest escaping them)
((.|\n)*) Again the same thing, just leave it away
The resulting Regular Expression would then be the following:
<_\d:InsurerId>([^<]*)<\/_\d:InsurerId>
And as you can see at https://regex101.com/r/mU6zZ3/1 - you have exactly one match, and it's even "F2021633_V1" :D
For Java, you have to escape the backslashes, so the resulting code would look like this:
Pattern pattern = Pattern.compile("<_\\d:InsurerId>([^<]*)<\\/_\\d:InsurerId>");
If you are using Java 7 and above, you can use naming groups to make the Regex a little bit more readable (also see the backreference group \k for close tag to match the openning tag):
Pattern pattern = Pattern.compile("(?:<(?<InsurancePrefix>.+)InsurerId>)(?<id>[A-Z0-9_]+)</\\k<InsurancePrefix>InsurerId>");
Matcher matcher = pattern.matcher("<_1:InsurerId>F2021633_V1</_1:InsurerId>");
if (matcher.matches()) {
System.out.println(matcher.group("id"));
}
Using back reference the matches() fails, for example, on this text
<_1:InsurerId>F2021633_V1</_2:InsurerId>
which is correct
Javadoc has a good explanation: https://docs.oracle.com/javase/8/docs/api/
Also you might consider using a different tool (XML parser) instead of Regex, as well, as other people have to support your code, and complex Regex is usually difficult to understand.
I have an arraylist links. All links having same format abc.([a-z]*)/\\d{4}/
List<String > links= new ArrayList<>();
links.add("abc.com/2012/aa");
links.add("abc.com/2014/dddd");
links.add("abc.in/2012/aa");
I need to get the last portion of every link. ie, the part after domain name. Domain name can be anything(.com, .in, .edu etc).
/2012/aa
/2014/dddd
/2012/aa
This is the output i want. How can i get this using regex?
Thanks
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.” Now they have two problems.
(see here for background)
Why use regex ? Perhaps a simpler solution is to use String.split("/") , which gives you an array of substrings of the original string, split by /. See this question for more info.
Note that String.split() does in fact take a regex to determine the boundaries upon which to split. However you don't need a regex in this case and a simple character specification is sufficient.
Try with below regex and use regex grouping feature that is grouped based on parenthesis ().
\.[a-zA-Z]{2,3}(/.*)
Pattern description :
dot followed by two or three letters followed by forward slash then any characters
DEMO
Sample code:
Pattern pattern = Pattern.compile("\\.[a-zA-Z]{2,3}(/.*)");
Matcher matcher = pattern.matcher("abc.com/2012/aa");
if (matcher.find()) {
System.out.println(matcher.group(1));
}
output:
/2012/aa
Note:
You can make it more precise by using \\.[a-zA-Z]{2,3}(/\\d{4}/.*) if there are always 4 digits in the pattern.
String result = s.replaceAll("^[^/]*","");
s would be the string in your list.
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.
Why not just use the URI class?
output = new URI(link).getPath()
Try this one and use the second capturing group
(.*?)(/.*)
Use foreach loop to iterate over list.
Use substring and indexOf('/').
FOR EXAMPLE
String s="abc.com/2014/dddd";
System.out.println(s.substring(s.indexOf('/')));
OUTPUT
/2014/dddd
Or you can go for split method.
System.out.println(s.split("/",2)[1]);//OUTPUT:2014/dddd --->you need to add /
I want to match my string to one sequence or another, and it has to match at least one of them.
For and I learned it can be done with:
(?=one)(?=other)
Is there something like this for OR?
I am using Java, Matcher and Pattern classes.
Generally speaking about regexes, you definitely should begin your journey into Regex wonderland here: Regex tutorial
What you currently need is the | (pipe character)
To match the strings one OR other, use:
(one|other)
or if you don't want to store the matches, just simply
one|other
To be Java specific, this article is very good at explaining the subject
You will have to use your patterns this way:
//Pattern and Matcher
Pattern compiledPattern = Pattern.compile(myPatternString);
Matcher matcher = pattern.matcher(myStringToMatch);
boolean isNextMatch = matcher.find(); //find next match, it exists,
if(isNextMatch) {
String matchedString = myStrin.substring(matcher.start(),matcher.end());
}
Please note, there are much more possibilities regarding Matcher then what I displayed here...
//String functions
boolean didItMatch = myString.matches(myPatternString); //same as Pattern.matches();
String allReplacedString = myString.replaceAll(myPatternString, replacement)
String firstReplacedString = myString.replaceFirst(myPatternString, replacement)
String[] splitParts = myString.split(myPatternString, howManyPartsAtMost);
Also, I'd highly recommend using online regex checkers such as Regexplanet (Java) or refiddle (this doesn't have Java specific checker), they make your life a lot easier!
The "or" operator is spelled |, for example one|other.
All the operators are listed in the documentation.
You can separate with a pipe thus:
Pattern.compile("regexp1|regexp2");
See here for a couple of simple examples.
Use the | character for OR
Pattern pat = Pattern.compile("exp1|exp2");
Matcher mat = pat.matcher("Input_data");
The answers are already given, use the pipe '|' operator. In addition to that, it might be useful to test your regexp in a regexp tester without having to run your application, for example:
http://www.regexplanet.com/advanced/java/index.html
Alright folks, my brain is fried. I'm trying to fix up some EMLs with bad boundaries by replacing the incorrect
--Boundary_([ArbitraryName])
lines with more proper
--Boundary_([ArbitraryName])--
lines, while leaving already correct
--Boundary_([ThisOneWasFine])--
lines alone. I've got the whole message in-memory as a String (yes, it's ugly, but JavaMail dies if it tries to parse these), and I'm trying to do a replaceAll on it. Here's the closest I can get.
//Identifie bondary lines that do not end in --
String regex = "^--Boundary_\\([^\\)]*\\)$";
Pattern pattern = Pattern.compile(regex,
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
Matcher matcher = pattern.matcher(targetString);
//Store all of our unique results.
HashSet<String> boundaries = new HashSet<String>();
while (matcher.find())
boundaries.add(s);
//Add "--" at the end of the Strings we found.
for (String boundary : boundaries)
targetString = targetString.replaceAll(Pattern.quote(boundary),
boundary + "--");
This has the obvious problem of replacing all of the valid
--Boundary_([WasValid])--
lines with
--Boundary_([WasValid])----
However, this is the only setup I've gotten to even perform the replacement. If I try changing Pattern.quote(boundary) to Pattern.quote(boundary) + "$", nothing is replaced. If I try just using matcher.replaceAll("$0--") instead of the two loops, nothing is replaced. What's an elegant way to achieve my aim and why does it work?
There's no need to iterate through the matches with find(); that's part of what replaceAll() does.
s = s.replaceAll("(?im)^--Boundary_\\([^\\)]*\\)$", "$0--");
The $0 in the replacement string is a placeholder whatever the regex matched in this iteration.
The (?im) at the beginning of the regex turns on CASE_INSENSITIVE and MULTILINE modes.
You can try something like this:
String regex = "^--Boundary_\\([^\\)]*\\)(--)?$";
then see if the string ends with -- and replace only ones that don't.
Assuming all the strings are on there own line this works:
"(?im)^--Boundary_\\([^)]*\\)$"
Example script:
String str = "--Boundary_([ArbitraryName])\n--Boundary_([ArbitraryName])--\n--Boundary_([ArbitraryName])\n--Boundary_([ArbitraryName])--\n";
System.out.println(str.replaceAll("(?im)^--Boundary_\\([^)]*\\)$", "$0--"));
Edit: changed from JavaScript to Java, must have read too fast.(Thanks for pointing it out)