Extracting data from a text file - repeated values - java

79 0009!017009!0479%0009!0479 0009!0469%0009!0469
0009!0459%0009!0459'009 0009!0459%0009!0449 0009!0449%0009!0449
0009!0439%0009!0439 0009!0429%0009!0429'009 0009!0429%0009!0419
0009!0419%0009!0409 000'009!0399 0009!0389%0009!0389'009
0009!0379%0009!0369 0009!0349%0009!0349 0009!0339%0009!0339
0009!0339%0009!0329'009 0009!0329%0009!0329 0009!032
In this data, I'm supposed to extract the number 47, 46 , 45 , 44 and so on. I´m supposed to avoid the rest. The numbers always follow this flow - 9!0 no 9%
for example: 9!0 42 9%
Which language should I go about to solve this and which function might help me?
Is there any function that can position a special character and copy the next two or three elements?
Ex: 9!0 42 9% and ' 009
look out for ! and then copy 42 from there and look out for ' that refers to another value (009). It's like two different regex to be used.

You can use whatever language you want, or even a unix command line utility like sed, awk, or grep. The regex should be something like this - you want to match 9!0 followed by digits followed by 0%. Use this regex: 9!0(\d+)0% (or if the numbers are all two digits, 9!0(\d{2})0%).

The other answers are fine, my regex solution is simply "9!.(\d\d)"
And here's a full solution in powershell, which can be easily correlated to other .net langs
$t="79 0009!017009!0479%0009!0479 0009!0469%0009!0469 0009!0459%0009!0459'009 0009!0459%0009!0449 0009!0449%0009!0449 0009!0439%0009!0439 0009!0429%0009!0429'009 0009!0429%0009!0419 0009!0419%0009!0409 000'009!0399 0009!0389%0009!0389'009 0009!0379%0009!0369 0009!0349%0009!0349 0009!0339%0009!0339 0009!0339%0009!0329'009 0009!0329%0009!0329 0009!032"
$p="9!.(\d\d)"
$ms=[regex]::match($t,$p)
while ($ms.Success) {write-host $ms.groups[1].value;$ms=$ms.NextMatch()}

This is perl:
#result = $subject =~ m/(?<=9!0)\d+(?=9%)/g;
It will give you an array of all your numbers. You didn't provide a language so I don't know if this is suitable for you or not.
Pattern regex = Pattern.compile("(?<=9!0)\\d+(?=9%)");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
// match start: regexMatcher.start()
// match end: regexMatcher.end()
}

Related

Searching Strings containing a regex in createCriteria Method

I'm using Grails for my web app project. I know the createCriteria method can perform search on existing entries in database. Let's say I have a domain "some_domain" which includes a string variable "domain_string". I want to find out all "domain_strings" that contain either a 7-digit or 10-digit number starting with "1" or "7". (e.g. domain_string1 = ".........1234567.......", domain_string2 = ".......7192839265......", etc)
In my code:
some_domain.createCriteria().list() {
rlike("domain_string", "%/^(1|7){7,10}/%")
}
I've used java regex here and the grails doc tells me that rlike is for regex input. But I can't get the exact output by the code because I'm not familiar with the groovy syntax. Any suggestions for that? Thanks a lot in advance.
You can use
rlike("domain_string", /([^0-9]|^)[17][0-9]{6}([0-9]{3})?([^0-9]|$)/)
See the regex demo.
Details:
([^0-9]|^) - either a non-digit char or start of string
[17] - 1 or 7
[0-9]{6} - any six digits
([0-9]{3})? - an optional occurrence of three digits
([^0-9]|$) - either a non-digit char or end of string.
Groovy regex by java native rules would look like:
def RE = /\D*[17]\d+\D*/
def domain_strings = [ ".........1234567.......", ".......7192839265......", ".......3192839265......", , ".......4192839265......" ]
domain_strings.each{
boolean match = it ==~ RE
println "$it matches? -> $match"
}
prints:
.........1234567....... matches? -> true
.......7192839265...... matches? -> true
.......3192839265...... matches? -> false
.......4192839265...... matches? -> false
You should check your DB SQL dialect if can consume such expressions as-is.

How to tokenize, scan or split this string of email addresses

For Simple Java Mail I'm trying to deal with a somewhat free-format of delimited email addresses. Note that I'm specifically not validating, just getting the addresses out of a list of addresses. For this use case the addresses can be assumed to be valid.
Here is an example of a valid input:
"name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
So there are two basic forms "name#domain.com" and "Joe Sixpack ", which can appear in a comma / semicolon delimited string, ignoring white space padding. The problem is that the names can contains delimiters as valid characters.
The following array shows the data needed (trailing spaces or delimiters would not be a big problem):
["name#domain.com",
"Sixpack, Joe 1 <name#domain.com>",
"Sixpack, Joe 2 <name#domain.com>",
"Sixpack, Joe, 3<name#domain.com>",
"nameFoo#domain.com",
"nameBar#domain.com",
"nameBaz#domain.com"]
I can't think of a clean way to deal with this. Any suggestion how I can reliably recognize whether a comma is part of a name or is a delimiter?
Final solution (variation on the accepted answer):
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
// recognize value tails and replace the delimiters there, disambiguating delimiters
const result = string
.replace(/(#.*?>?)\s*[,;]/g, "$1<|>")
.replace(/<\|>$/,"") // remove trailing delimiter
.split(/\s*<\|>\s*/) // split on delimiter including surround space
console.log(result)
Or in Java:
public static String[] extractEmailAddresses(String emailAddressList) {
return emailAddressList
.replaceAll("(#.*?>?)\\s*[,;]", "$1<|>")
.replaceAll("<\\|>$", "")
.split("\\s*<\\|>\\s*");
}
since you are not validating, i assume that the email addresses are valid.
Based on this assumption, i will look up an email address followed by ; or , this way i know its valid.
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
const result = string.match(/(.*?#.*?\..*?)[,;]/g)
console.log(result)
This pattern works for your provided examples:
([^#,;\s]+#[^#,;\s]+)|(?:$|\s*[,;])(?:\s*)(.*?)<([^#,;\s]+#[^#,;\s]+)>
([^#,;\s]+#[^#,;\s]+) # email defined by an # with connected chars except ',' ';' and white-space
| # OR
(?:$|\s*[,;])(?:\s*) # start of line OR 0 or more spaces followed by a separator, then 0 or more white-space chars
(.*?) # name
<([^#,;\s]+#[^#,;\s]+)> # email enclosed by lt-gt
PCRE Demo
Using Java's replaceAll and split functions (mimicked in javascript below), I would say lock onto what you know ends an item (the ".com"), replace separator characters with a unique temp (a uuid or something like <|>), and then split using your refactored delimiter.
Here is a javascript example, but Java's repalceAll and split can do the same job.
var string = "name#domain.com,Joe Sixpack <name#domain.com>, Sixpack, Joe <name#domain.com> ;Sixpack, Joe<name#domain.com> , name#domain.com,name#domain.com;name#domain.com;"
const result = string.replace(/(\.com>?)[\s,;]+/g, "$1<|>").replace(/<\|>$/,"").split("<|>")
console.log(result)

Regex for almost JSON but not quite

Hello all I'm trying to parse out a pretty well formed string into it's component pieces. The string is very JSON like but it's not JSON strictly speaking. They're formed like so:
createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source="Region", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
With output just as chunks of text nothing special has to be done at this point.
createdAt=Fri Aug 24 09:48:51 EDT 2012
id=238996293417062401
text='Test Test'
source="Region"
entities=[foo, bar]
user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
Using the following expression I am able to get most of the fields separated out
,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))(?=(?:[^']*'[^']*')*(?![^']*'))
Which will split on all the commas not in quotes of any type, but I can't seem to make the leap to where it splits on commas not in brackets or braces as well.
Because you want to handle nested parens/brackets, the "right" way to handle them is to tokenize them separately, and keep track of your nesting level. So instead of a single regex, you really need multiple regexes for your different token types.
This is Python, but converting to Java shouldn't be too hard.
# just comma
sep_re = re.compile(r',')
# open paren or open bracket
inc_re = re.compile(r'[[(]')
# close paren or close bracket
dec_re = re.compile(r'[)\]]')
# string literal
# (I was lazy with the escaping. Add other escape sequences, or find an
# "official" regex to use.)
chunk_re = re.compile(r'''"(?:[^"\\]|\\")*"|'(?:[^'\\]|\\')*[']''')
# This class could've been just a generator function, but I couldn;'t
# find a way to manage the state in the match function that wasn't
# awkward.
class tokenizer:
def __init__(self):
self.pos = 0
def _match(self, regex, s):
m = regex.match(s, self.pos)
if m:
self.pos += len(m.group(0))
self.token = m.group(0)
else:
self.token = ''
return self.token
def tokenize(self, s):
field = '' # the field we're working on
depth = 0 # how many parens/brackets deep we are
while self.pos < len(s):
if not depth and self._match(sep_re, s):
# In Java, change the "yields" to append to a List, and you'll
# have something roughly equivalent (but non-lazy).
yield field
field = ''
else:
if self._match(inc_re, s):
depth += 1
elif self._match(dec_re, s):
depth -= 1
elif self._match(chunk_re, s):
pass
else:
# everything else we just consume one character at a time
self.token = s[self.pos]
self.pos += 1
field += self.token
yield field
Usage:
>>> list(tokenizer().tokenize('foo=(3,(5+7),8),bar="hello,world",baz'))
['foo=(3,(5+7),8)', 'bar="hello,world"', 'baz']
This implementation takes a few shortcuts:
The string escapes are really lazy: it only supports \" in double quoted strings and \' in single-quoted strings. This is easy to fix.
It only keeps track of nesting level. It does not verify that parens are matched up with parens (rather than brackets). If you care about that you can change depth into some sort of stack and push/pop parens/brackets onto it.
Instead of splitting on the comma, you can use the following regular expression to match the chunks that you want.
(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)
Python:
import re
text = "createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source=\"Region\", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}"
re.findall(r'(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)', text)
>> [
('createdAt', 'Fri Aug 24 09:48:51 EDT 2012'),
('id', '238996293417062401'),
('text', "'Test Test'"),
('source', '"Region"'),
('entities', '[foo, bar]'),
('user', '{name=test, locations=[loc1,loc2], locations={comp1, comp2}}')
]
I've set up grouping so it will separate out the "key" and the "value". It will do the same in Java - See it working in Java here:
http://www.regexplanet.com/cookbook/ahJzfnJlZ2V4cGxhbmV0LWhyZHNyDgsSBlJlY2lwZRj0jzQM/index.html
Regular Expression explained:
(?:^| ) Non-capturing group that matches the beginning of a line, or a space
(.+?) Matches the "key" before the...
= equal sign
(\{.+?\}|\[.+?\]|.+?) Matches either a set of {characters}, [characters], or finally just characters
(?=,|$) Look ahead that matches either a , or the end of a line.

Regex: strip all tags except those containing keyword "univ"

[introduction][position]Lead Researcher and Research Manager[/position] in the [affiliation]Web Search and Mining Group, Microsoft Research[/affiliation]</b>.
I am a [position]lead researcher[/position] at [affiliation]Microsoft Research[/affiliation]. I am also [position]adjunct professor[/position] of [affiliation]Peking University[/affiliation], [affiliation]Xian Jiaotong University[/affiliation] and [affiliation]Nankai University[/affiliation].
I joined [affiliation]Microsoft Research[/affiliation] in June 2001. Prior to that, I worked at the Research Laboratories of NEC Corporation.
I obtained a [bsdegree]B.S.[/bsdegree] in [bsmajor]Electrical Engineering[/bsmajor] from [bsuniv]Kyoto University[/bsuniv] in [bsdate]1988[/bsdate] and a [msdegree]M.S.[/msdegree] in [msmajor]Computer Science[/msmajor] from [msuniv]Kyoto University[/msuniv] in [msdate]1990[/msdate]. I earned my [phddegree]Ph.D.[/phddegree] in [phdmajor]Computer Science[/phdmajor] from the [phduniv]University of Tokyo[/phduniv] in [phddate]1998[/phddate].
I am interested in [interests]statistical learning[/interests], [interests]natural language processing[/interests], [interests]data mining, and information retrieval[/interests].[/introduction]
I'm able to strip all tags from the paragraph above with:
String stripped = html.replaceAll("\\[.*?\\]", "");
But I'd like to keep three pairs of tags in the paragraph, which are [bsuniv][/bsuniv],[msuniv][/msuniv] and [phduniv][/phduniv]. In other words, I don't want to strip those tags containing the keyword "univ". I can't find a convenient way to rewrite the regular expression. Anyone help me?
You can use a negative-look ahead assertion here: -
str = str.replaceAll("\\[(.(?!univ))*?\\]", "");
or: -
str = str.replaceAll("\\[((?!univ).)*?\\]", "");
Both of them will give you the desired output. There is only one difference -
The first one does a negative look-ahead, against the current character, and if it is not followed by univ, it moves to the next character.
The second one does a negative look-ahead against an empty string before every character, and if it is not followed by univ, it goes ahead to match a single character.

Bug in java.util.regex in sun jdk 6.0.24?

The following code blocks on my system. Why?
System.out.println( Pattern.compile(
"^((?:[^'\"][^'\"]*|\"[^\"]*\"|'[^']*')*)/\\*.*?\\*/(.*)$",
Pattern.MULTILINE | Pattern.DOTALL ).matcher(
"\n\n\n\n\n\nUPDATE \"$SCHEMA\" SET \"VERSION\" = 12 WHERE NAME = 'SOMENAMEVALUE';"
).matches() );
The pattern (designed to detect comments of the form /*...*/ but not within ' or ") should be fast, as it is deterministic...
Why does it take soooo long?
You're running into catastrophic backtracking.
Looking at your regex, it's easy to see how .*? and (.*) can match the same content since both also can match the intervening \*/ part (dot matches all, remember). Plus (and even more problematic), they can also match the same stuff that ((?:[^'"][^'"]*|"[^"]*"|'[^']*')*) matches.
The regex engine gets bogged down in trying all the permutations, especially if the string you're testing against is long.
I've just checked your regex against your string in RegexBuddy. It aborts the match attempt after 1.000.000 steps of the regex engine. Java will keep churning on until it gets through all permutations or until a Stack Overflow occurs...
You can greatly improve the performance of your regex by prohibiting backtracking into stuff that has already been matched. You can use atomic groups for this, changing your regex into
^((?>[^'"]+|"[^"]*"|'[^']*')*)(?>/\*.*?\*/)(.*)$
or, as a Java string:
"^((?>[^'\"]+|\"[^\"]*\"|'[^']*')*)(?>/\\*.*?\\*/)(.*)$"
This reduces the number of steps the regex engine has to go through from > 1 million to 58.
Be advised though that this will only find the first occurrence of a comment, so you'll have to apply the regex repeatedly until it fails.
Edit: I just added two slashes that were important for the expressions to work. Yet I had to change more than 6 characters.... :(
I recommend that you read Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...).
I think it's because of this bit:
(?:[^'\"][^'\"]*|\"[^\"]*\"|'[^']*')*
Removing the second and third alternatives gives you:
(?:[^'\"][^'\"]*)*
or:
(?:[^'\"]+)*
Repeated repeats can take a long time.
For comment /* and */ detection I would suggest having a code like this:
String str = "\n\n\n\n\n\nUPDATE \"$SCHEMA\" /*a comment\n\n*/ SET \"VERSION\" = 12 WHERE NAME = 'SOMENAMEVALUE';";
Pattern pt = Pattern.compile("\"[^\"]*\"|'[^']*'|(/\\*.*?\\*/)",
Pattern.MULTILINE | Pattern.DOTALL);
Matcher matcher = pt.matcher(str);
boolean found = false;
while (matcher.find()) {
if (matcher.group(1) != null) {
found = true;
break;
}
}
if (found)
System.out.println("Found Comment: [" + matcher.group(1) + ']');
else
System.out.println("Didn't find Comment");
For above string it prints:
Found Comment: [/*a comment
*/]
But if I change input string to:
String str = "\n\n\n\n\n\nUPDATE \"$SCHEMA\" '/*a comment\n\n*/' SET \"VERSION\" = 12 WHERE NAME = 'SOMENAMEVALUE';";
OR
String str = "\n\n\n\n\n\nUPDATE \"$SCHEMA\" \"/*a comment\n\n*/\" SET \"VERSION\" = 12 WHERE NAME = 'SOMENAMEVALUE';";
Output is:
Didn't find Comment

Categories