IPV6 address into compressed form in Java - java

I have used Inet6Address.getByName("2001:db8:0:0:0:0:2:1").toString() method to compress IPv6 address, and the output is 2001:db8:0:0:0:0:2:1 ,but i need 2001:db8::2:1 . , Basically the compression output should based on RFC 5952 standard , that is
Shorten as Much as Possible : For example, 2001:db8:0:0:0:0:2:1 must be shortened to
2001:db8::2:1.Likewise, 2001:db8::0:1 is not acceptable,
because the symbol "::" could have been used to produce a
shorter representation 2001:db8::1.
Handling One 16-Bit 0 Field : The symbol "::" MUST NOT be used to shorten just one 16-bit 0 field.
For example, the representation 2001:db8:0:1:1:1:1:1 is correct, but
2001:db8::1:1:1:1:1 is not correct.
Choice in Placement of "::" : = When there is an alternative choice in the placement of a "::", the
longest run of consecutive 16-bit 0 fields MUST be shortened (i.e.,
the sequence with three consecutive zero fields is shortened in 2001:
0:0:1:0:0:0:1). When the length of the consecutive 16-bit 0 fields
are equal (i.e., 2001:db8:0:0:1:0:0:1), the first sequence of zero
bits MUST be shortened. For example, 2001:db8::1:0:0:1 is correct
representation.
I have also checked another post in Stack overflow, but there was no condition specified (example choice in placement of ::).
Is there any java library to handle this? Could anyone please help me?
Thanks in advance.

How about this?
String resultString = subjectString.replaceAll("((?::0\\b){2,}):?(?!\\S*\\b\\1:0\\b)(\\S*)", "::$2").replaceFirst("^0::","::");
Explanation without Java double-backslash hell:
( # Match and capture in backreference 1:
(?: # Match this group:
:0 # :0
\b # word boundary
){2,} # twice or more
) # End of capturing group 1
:? # Match a : if present (not at the end of the address)
(?! # Now assert that we can't match the following here:
\S* # Any non-space character sequence
\b # word boundary
\1 # the previous match
:0 # followed by another :0
\b # word boundary
) # End of lookahead. This ensures that there is not a longer
# sequence of ":0"s in this address.
(\S*) # Capture the rest of the address in backreference 2.
# This is necessary to jump over any sequences of ":0"s
# that are of the same length as the first one.
Input:
2001:db8:0:0:0:0:2:1
2001:db8:0:1:1:1:1:1
2001:0:0:1:0:0:0:1
2001:db8:0:0:1:0:0:1
2001:db8:0:0:1:0:0:0
Output:
2001:db8::2:1
2001:db8:0:1:1:1:1:1
2001:0:0:1::1
2001:db8::1:0:0:1
2001:db8:0:0:1::
(I hope the last example is correct - or is there another rule if the address ends in 0?)

I recently ran into the same problem and would like to (very slightly) improve on Tim's answer.
The following regular expression offers two advantages:
((?:(?:^|:)0+\\b){2,}):?(?!\\S*\\b\\1:0+\\b)(\\S*)
Firstly, it incorporates the change to match multiple zeroes. Secondly, it also correctly matches addresses where the longest chain of zeroes is at the beginning of the address (such as 0:0:0:0:0:0:0:1).

Guava's InetAddresses class has toAddrString() which formats according to RFC 5952.

java-ipv6 is almost what you want. As of version 0.10 it does not check for the longest run of zeroes to shorten with :: - for instance 0:0:1:: is shortened to ::1:0:0:0:0:0. It is a very decent library for the handling of IPv6 addresses, though, and this problem should be fixed with version 0.11, such that the library is RFC 5952 compliant.

The open-source IPAddress Java library can do as described, it provides numerous ways of producing strings for IPv4 and/or IPv6, including the canonical string which for IPv6 matches rfc 5952. Disclaimer: I am the project manager of that library.
Using the examples you list, sample code is:
IPAddress addr = new IPAddressString("2001:db8:0:0:0:0:2:1").getAddress();
System.out.println(addr.toCanonicalString());
// 2001:db8::2:1
addr = new IPAddressString("2001:db8:0:1:1:1:1:1").getAddress();
System.out.println(addr.toCanonicalString());
// 2001:db8:0:1:1:1:1:1
addr = new IPAddressString("2001:0:0:1:0:0:0:1").getAddress();
System.out.println(addr.toCanonicalString());
// 2001:0:0:1::1
addr = new IPAddressString("2001:db8:0:0:1:0:0:1").getAddress();
System.out.println(addr.toCanonicalString());
//2001:db8::1:0:0:1

After performing some tests, I think the following captures all the different IPv6 scenarios:
"((?:(?::0|0:0?)\\b){2,}):?(?!\\S*\\b\\1:0\\b)(\\S*)" -> "::$2"

Not quite elegant but this is my proposal (based on chrixm work):
public static String shortIpv6Form(String fullIP) {
fullIP = fullIP.replaceAll("^0{1,3}", "");
fullIP = fullIP.replaceAll("(:0{1,3})", ":");
fullIP = fullIP.replaceAll("(0{4}:)", "0:");
//now we have full form without unnecessaires zeros
//Ex:
//0000:1200:0000:0000:0000:0000:0000:0000 -> 0:1200:0:0:0:0:0:0
//0000:0000:0000:1200:0000:0000:0000:8351 -> 0:0:0:1200:0:0:0:8351
//0000:125f:0000:94dd:e53f:0000:61a9:0000 -> 0:125f:0:94dd:e53f:0:61a9:0
//0000:005f:0000:94dd:0000:cfe7:0000:8351 -> 0:5f:0:94dd:0:cfe7:0:8351
//compress to short notation
fullIP = fullIP.replaceAll("((?:(?:^|:)0+\\b){2,}):?(?!\\S*\\b\\1:0+\\b)(\\S*)", "::$2");
return fullIP;
}
results:
7469:125f:8eb6:94dd:e53f:cfe7:61a9:8351 ->
7469:125f:8eb6:94dd:e53f:cfe7:61a9:8351
7469:125f:0000:0000:e53f:cfe7:0000:0000 -> 7469:125f::e53f:cfe7:0:0
7469:125f:0000:0000:000f:c000:0000:0000 -> 7469:125f::f:c000:0:0
7469:125f:0000:0000:000f:c000:0000:0000 -> 7469:125f::f:c000:0:0
7469:0000:0000:94dd:0000:0000:0000:8351 -> 7469:0:0:94dd::8351
0469:125f:8eb6:94dd:0000:cfe7:61a9:8351 ->
469:125f:8eb6:94dd:0:cfe7:61a9:8351
0069:125f:8eb6:94dd:0000:cfe7:61a9:8351 ->
69:125f:8eb6:94dd:0:cfe7:61a9:8351
0009:125f:8eb6:94dd:0000:cfe7:61a9:8351 ->
9:125f:8eb6:94dd:0:cfe7:61a9:8351
0000:0000:8eb6:94dd:e53f:0007:6009:8350 ->
::8eb6:94dd:e53f:7:6009:8350 0000:0000:8eb6:94dd:e53f:0007:6009:8300
-> ::8eb6:94dd:e53f:7:6009:8300 0000:0000:8eb6:94dd:e53f:0007:6009:8000 ->
::8eb6:94dd:e53f:7:6009:8000 7469:0000:0000:0000:e53f:0000:0000:8300
-> 7469::e53f:0:0:8300 7009:100f:8eb6:94dd:e000:cfe7:6009:8351 -> 7009:100f:8eb6:94dd:e000:cfe7:6009:8351
7469:100f:8006:900d:e53f:cfe7:61a9:8351 ->
7469:100f:8006:900d:e53f:cfe7:61a9:8351
7000:1200:8e00:94dd:e53f:cfe7:0000:0001 ->
7000:1200:8e00:94dd:e53f:cfe7:0:1
0000:0000:0000:0000:0000:0000:0000:0000 -> ::
0000:0000:0000:94dd:0000:0000:0000:0000 -> 0:0:0:94dd::
0000:1200:0000:0000:0000:0000:0000:0000 -> 0:1200::
0000:0000:0000:1200:0000:0000:0000:8351 -> ::1200:0:0:0:8351
0000:125f:0000:94dd:e53f:0000:61a9:0000 ->
0:125f:0:94dd:e53f:0:61a9:0 7469:0000:8eb6:0000:e53f:0000:61a9:0000
-> 7469:0:8eb6:0:e53f:0:61a9:0 0000:125f:0000:94dd:0000:cfe7:0000:8351 ->
0:125f:0:94dd:0:cfe7:0:8351 0000:025f:0000:94dd:0000:cfe7:0000:8351
-> 0:25f:0:94dd:0:cfe7:0:8351 0000:005f:0000:94dd:0000:cfe7:0000:8351 -> 0:5f:0:94dd:0:cfe7:0:8351
0000:000f:0000:94dd:0000:cfe7:0000:8351 -> 0:f:0:94dd:0:cfe7:0:8351
0000:0000:0000:0000:0000:0000:0000:0001 -> ::1

Related

Searching Strings containing a regex in createCriteria Method

I'm using Grails for my web app project. I know the createCriteria method can perform search on existing entries in database. Let's say I have a domain "some_domain" which includes a string variable "domain_string". I want to find out all "domain_strings" that contain either a 7-digit or 10-digit number starting with "1" or "7". (e.g. domain_string1 = ".........1234567.......", domain_string2 = ".......7192839265......", etc)
In my code:
some_domain.createCriteria().list() {
rlike("domain_string", "%/^(1|7){7,10}/%")
}
I've used java regex here and the grails doc tells me that rlike is for regex input. But I can't get the exact output by the code because I'm not familiar with the groovy syntax. Any suggestions for that? Thanks a lot in advance.
You can use
rlike("domain_string", /([^0-9]|^)[17][0-9]{6}([0-9]{3})?([^0-9]|$)/)
See the regex demo.
Details:
([^0-9]|^) - either a non-digit char or start of string
[17] - 1 or 7
[0-9]{6} - any six digits
([0-9]{3})? - an optional occurrence of three digits
([^0-9]|$) - either a non-digit char or end of string.
Groovy regex by java native rules would look like:
def RE = /\D*[17]\d+\D*/
def domain_strings = [ ".........1234567.......", ".......7192839265......", ".......3192839265......", , ".......4192839265......" ]
domain_strings.each{
boolean match = it ==~ RE
println "$it matches? -> $match"
}
prints:
.........1234567....... matches? -> true
.......7192839265...... matches? -> true
.......3192839265...... matches? -> false
.......4192839265...... matches? -> false
You should check your DB SQL dialect if can consume such expressions as-is.

How to tokenize, scan or split this string of email addresses

For Simple Java Mail I'm trying to deal with a somewhat free-format of delimited email addresses. Note that I'm specifically not validating, just getting the addresses out of a list of addresses. For this use case the addresses can be assumed to be valid.
Here is an example of a valid input:
"name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
So there are two basic forms "name#domain.com" and "Joe Sixpack ", which can appear in a comma / semicolon delimited string, ignoring white space padding. The problem is that the names can contains delimiters as valid characters.
The following array shows the data needed (trailing spaces or delimiters would not be a big problem):
["name#domain.com",
"Sixpack, Joe 1 <name#domain.com>",
"Sixpack, Joe 2 <name#domain.com>",
"Sixpack, Joe, 3<name#domain.com>",
"nameFoo#domain.com",
"nameBar#domain.com",
"nameBaz#domain.com"]
I can't think of a clean way to deal with this. Any suggestion how I can reliably recognize whether a comma is part of a name or is a delimiter?
Final solution (variation on the accepted answer):
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
// recognize value tails and replace the delimiters there, disambiguating delimiters
const result = string
.replace(/(#.*?>?)\s*[,;]/g, "$1<|>")
.replace(/<\|>$/,"") // remove trailing delimiter
.split(/\s*<\|>\s*/) // split on delimiter including surround space
console.log(result)
Or in Java:
public static String[] extractEmailAddresses(String emailAddressList) {
return emailAddressList
.replaceAll("(#.*?>?)\\s*[,;]", "$1<|>")
.replaceAll("<\\|>$", "")
.split("\\s*<\\|>\\s*");
}
since you are not validating, i assume that the email addresses are valid.
Based on this assumption, i will look up an email address followed by ; or , this way i know its valid.
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
const result = string.match(/(.*?#.*?\..*?)[,;]/g)
console.log(result)
This pattern works for your provided examples:
([^#,;\s]+#[^#,;\s]+)|(?:$|\s*[,;])(?:\s*)(.*?)<([^#,;\s]+#[^#,;\s]+)>
([^#,;\s]+#[^#,;\s]+) # email defined by an # with connected chars except ',' ';' and white-space
| # OR
(?:$|\s*[,;])(?:\s*) # start of line OR 0 or more spaces followed by a separator, then 0 or more white-space chars
(.*?) # name
<([^#,;\s]+#[^#,;\s]+)> # email enclosed by lt-gt
PCRE Demo
Using Java's replaceAll and split functions (mimicked in javascript below), I would say lock onto what you know ends an item (the ".com"), replace separator characters with a unique temp (a uuid or something like <|>), and then split using your refactored delimiter.
Here is a javascript example, but Java's repalceAll and split can do the same job.
var string = "name#domain.com,Joe Sixpack <name#domain.com>, Sixpack, Joe <name#domain.com> ;Sixpack, Joe<name#domain.com> , name#domain.com,name#domain.com;name#domain.com;"
const result = string.replace(/(\.com>?)[\s,;]+/g, "$1<|>").replace(/<\|>$/,"").split("<|>")
console.log(result)

Matching groups with lookahead expression

I have problem with matching groups that contain lookahead expression. I don't know why this expressions doesn't work:
"""((?<=^)(.*)(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%))((?<=[\w:]\s)(\w+)(?=\s[cr]))"""
When I compile them separately, for example:
"""(?<=^)(.*)(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%)"""
I get the correct result
My sample text:
May 5 23:00:01 10.14.3.10 %ASA-6-302015: Built inbound UDP connection
Expressions have been checked with this tool: http://regex-testdrive.com/en/dotest
My Scala code:
import scala.util.matching.Regex
val text = "May 5 23:00:01 10.14.3.10 %ASA-6-302015: Built inbound UDP connection"
val regex = new Regex("""((?<=^)(.*)(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%))((?<=[\w:]\s)(\w+)(?=\s[cr]))""")
val result = regex.findAllIn(text)
Does anyone know solution of this problem?
Multiple matching
You may fix the pattern as
^.*?(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%)|(?<=[\w:]\s)\w+(?=\s[cr])
See the regex demo. The main point is to introduce the | alternation operator to match either of the 2 subpatterns. Note you do not need to put the ^ start of string anchor into a lookbehind, as ^ is already a zero-width assertion. Also, there are too many groupings that you do not seem to use any way. Also, to match a literal dot you need to escape it (. -> \.).
To obtain the multiple matches, you may use the following code snippet:
val text = "May 5 23:00:01 10.14.3.10 %ASA-6-302015: Built inbound UDP connection"
val regex = """^.*?(?=\s\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\s%)|(?<=[\w:]\s)\w+(?=\s[cr])""".r
val result = regex.findAllIn(text)
result.foreach { x => println(x) }
// => May 5 23:00:01
// UDP
See the Scala online demo.
Note that once a pattern is used with .FindAllIn, it is not anchored by default, so you will get all the matches there are in the input string.
Capturing groups
Another approach you may use is matching the whole line while capturing the necessary bits with capturing groups:
val text = "May 5 23:00:01 10.14.3.10 %ASA-6-302015: Built inbound UDP connection"
val regex = """^(.*?)\s+\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%.*[\w:]\s+(\w+)\s+[cr].*""".r
val results = text match {
case regex(date, protocol) => Array(date, protocol)
case _ => Array[String]()
}
// Demo printing
results.foreach { m =>
println(m)
}
See another Scala demo. Since match requires a full string match, .* is added at the end of the pattern, and only relevant pairs of unescaped (...) are kept in the pattern. See the regex demo here.
your matches are not next to each other,
try this:
"""((?<=^)(.*)(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%)).*((?<=[\w:]\s)(\w+)(?=\s[cr]))"""
I just added the .* between them, it works on the link you sent :)

Modification of java regex

I was using the below regex to substitute file names
Regex -> .*\/([A-Z0-9_]{1,9})_(O).*.cmd
Substitution -> $1
The file names were like below:
File Name | Substituted Name
---------------------------------- ------------------
/V3/OGM_REC_Offline_Level0_4D.cmd OGM_REC
/V2/PIE_PROD_Online_Level1_6D.cmd PIE_PROD
/V3/BR2_OnDemand.cmd BR2
/opt/STING_Online_Inc0_1W.cmd STING
Then the files changed and I modified the regex
Regex -> .*\/([A-Z0-9_]{1,9})(_O|Full).*.cmd
Substitution -> $1
Additional new file names
File Name | Substituted Name
---------------------- ------------------
/opt/RSU10Full.cmd RSU10
/V4/REZ40_1Full.cmd REZ40_1
Now, it seems there are new files are getting updated with below name formats
/app/OMGIT_FullOnDemand_4W.cmd
/admin/FOC_STG_Full_6D.cmd
I've modified the regex again, but it's not getting successful
Regex -> .*\/([A-Z0-9_]{1,9})(_O|Full|_Full).*.cmd
Substitution -> $1
I suggest using a version with a lazy limiting quantifier {1,9}? and optional _:
.*/([A-Z0-9_]{1,9}?)(_O|_?Full).*[.]cmd
This way, we match as few characters with [A-Z0-9_]{1,9}? as possible to return a valid captured subtext, and _?Full part can hold the optional underscore.
See the regex demo
I've noticed that unnecessary tail is allways started with: (optional) _, letter in uppercase, letter in lowercase.
So, universal solution is:
.*\/([^a-z]*?)[_]?[A-Z][a-z].*

Regex for almost JSON but not quite

Hello all I'm trying to parse out a pretty well formed string into it's component pieces. The string is very JSON like but it's not JSON strictly speaking. They're formed like so:
createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source="Region", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
With output just as chunks of text nothing special has to be done at this point.
createdAt=Fri Aug 24 09:48:51 EDT 2012
id=238996293417062401
text='Test Test'
source="Region"
entities=[foo, bar]
user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
Using the following expression I am able to get most of the fields separated out
,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))(?=(?:[^']*'[^']*')*(?![^']*'))
Which will split on all the commas not in quotes of any type, but I can't seem to make the leap to where it splits on commas not in brackets or braces as well.
Because you want to handle nested parens/brackets, the "right" way to handle them is to tokenize them separately, and keep track of your nesting level. So instead of a single regex, you really need multiple regexes for your different token types.
This is Python, but converting to Java shouldn't be too hard.
# just comma
sep_re = re.compile(r',')
# open paren or open bracket
inc_re = re.compile(r'[[(]')
# close paren or close bracket
dec_re = re.compile(r'[)\]]')
# string literal
# (I was lazy with the escaping. Add other escape sequences, or find an
# "official" regex to use.)
chunk_re = re.compile(r'''"(?:[^"\\]|\\")*"|'(?:[^'\\]|\\')*[']''')
# This class could've been just a generator function, but I couldn;'t
# find a way to manage the state in the match function that wasn't
# awkward.
class tokenizer:
def __init__(self):
self.pos = 0
def _match(self, regex, s):
m = regex.match(s, self.pos)
if m:
self.pos += len(m.group(0))
self.token = m.group(0)
else:
self.token = ''
return self.token
def tokenize(self, s):
field = '' # the field we're working on
depth = 0 # how many parens/brackets deep we are
while self.pos < len(s):
if not depth and self._match(sep_re, s):
# In Java, change the "yields" to append to a List, and you'll
# have something roughly equivalent (but non-lazy).
yield field
field = ''
else:
if self._match(inc_re, s):
depth += 1
elif self._match(dec_re, s):
depth -= 1
elif self._match(chunk_re, s):
pass
else:
# everything else we just consume one character at a time
self.token = s[self.pos]
self.pos += 1
field += self.token
yield field
Usage:
>>> list(tokenizer().tokenize('foo=(3,(5+7),8),bar="hello,world",baz'))
['foo=(3,(5+7),8)', 'bar="hello,world"', 'baz']
This implementation takes a few shortcuts:
The string escapes are really lazy: it only supports \" in double quoted strings and \' in single-quoted strings. This is easy to fix.
It only keeps track of nesting level. It does not verify that parens are matched up with parens (rather than brackets). If you care about that you can change depth into some sort of stack and push/pop parens/brackets onto it.
Instead of splitting on the comma, you can use the following regular expression to match the chunks that you want.
(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)
Python:
import re
text = "createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source=\"Region\", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}"
re.findall(r'(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)', text)
>> [
('createdAt', 'Fri Aug 24 09:48:51 EDT 2012'),
('id', '238996293417062401'),
('text', "'Test Test'"),
('source', '"Region"'),
('entities', '[foo, bar]'),
('user', '{name=test, locations=[loc1,loc2], locations={comp1, comp2}}')
]
I've set up grouping so it will separate out the "key" and the "value". It will do the same in Java - See it working in Java here:
http://www.regexplanet.com/cookbook/ahJzfnJlZ2V4cGxhbmV0LWhyZHNyDgsSBlJlY2lwZRj0jzQM/index.html
Regular Expression explained:
(?:^| ) Non-capturing group that matches the beginning of a line, or a space
(.+?) Matches the "key" before the...
= equal sign
(\{.+?\}|\[.+?\]|.+?) Matches either a set of {characters}, [characters], or finally just characters
(?=,|$) Look ahead that matches either a , or the end of a line.

Categories