I have a set of files, thousands of files. Let's say this files are like :
abc a
abc bnd nm
abc vcb
abc
abc something
nmn as
nmn af
nmn bvf
I need to group those files. I need to group this files by partial name match. So, in this example I will have 2 groups. One will be the group [abc] and the group [nmn]. Any suggestions?
Edit: Turns out there's a method in String that makes this much easier than regex: String.startsWith(String prefix)
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#startsWith(java.lang.String)
Regex is useful, but overkill for something like this. My bad for overthinking this at first...
(old answer below)
Sounds like a job for regex!
http://docs.oracle.com/javase/7/docs/api/java/util/regex/package-summary.html
String.matches() would help too.
Basically, you'll want to create regexes that match the first sections of your file names and then anything else following those. In the example in your question, example regexes would be "abc." for abc______ and "nmn." for nmn_____. These probably aren't 100% correct syntax, but that's the general idea. The rules in the link (look at the Pattern class) would give you pretty much all you need.
What you'd do is create two new Sets, one for each prefix. Then loop through the original set, and based on the regex put the file name into one set or the other.
create a hash map of String as key and list of File(in case you want to store file object) or String(in case file name)
and check if name starts with Anti put it in group 1 with the key whatever you like
starts with Philips put in group 2 with the key whatever you like
Related
I am trying to create a regex in Java to match and get the name, version, channel and owner for each dependency but I haven't been able to have one that covers all the possible scenarios:
the structure is something like name/version#owner/channel, where the version might have a semver structure, the owner and channel are optional.
Currently, I have :
^(?<name>[\d\w][\d\w\+\.-]+)\/(?<version>[\d\w][\d\w\.-]+)(#(?<owner>\w+))?(\/(?<channel>.+))?$
but it's failing for boost_atomic/1.59.0+4#owner/release, since the +4 is not matched and I need the value before that -> 1.59.0
Some other scenarios that need to be valid and are valid for the regex above are:
Poco/1.9.0#pocoproject/stable
zlib/1.2.11#conan/stable
freetype/2.10.1/stable
openssl/1.0.2g/stable
openssl/1.0.2g
openssl/1.0.2g#owner
Also, there might be some dependencies with comments :
zlib/1.2.11#conan/stable # comment
In that case I would need to get rid of the component and only get the relevant information with the regex.
I am not sure if my current regex is good, but from what I've tested only some scenarios are missing
You can simplify your regex and avoid putting too many characters in that character set and escaping them, instead use something like [^\/] to capture anything except / as you want to capture anything preceding a slash.
I've made some modifications and the updated regex that should work for you is following,
^(?<name>[^\/]+)\/(?<version>[^\/#\s]+)(#(?<owner>\w+))?(\/(?<channel>\S+))?(?:\s*#\s*(?<comment>.+))?$
I've added another named group for comment as you mentioned that can also be present. Let me know if this works for you.
Try this demo
Edit: If channel contains a text like release:132434 and anything followed by a colon is to be ignored as part of channel, you can use updated regex below,
^(?<name>[^\/]+)\/(?<version>[^\/#\s]+)(?:#(?<owner>\w+))?(?:\/(?<channel>[^:\s]+)\S*)?(?:\s*#\s*(?<comment>.+))?\s*$
Updated Demo
I am working on Java project [Maven].
I am confused in one point. I don't know what is logiclaly corect.
Problem is as follows :-
Sentence is given, and from their I have extract some particular words.
Solution that I found
I make one regex and put in Constants class. Whenever I have to add more words, I simply appended words in regex.
This solves the problem.
I am confused here
I am thinking, if I put numbers of text files in resources folder where each text file denotes one regex expression.
REGEX = (?:A|B|C|D)
A, B, C, D = Word(String)
Is it a good idea ? If not please suggest any other.
Why would you save regex's in a text file? The fact that you're using a regex seems like an implementation detail that you would want to encapsulate (unless you want the significantly greater functionality but also overhead of supporting regexes).
Also, why do you need new files for each word? That seems like you could just have one file with a word per line that is all of the words you're interested in. This would be much more simple for a user to understand than 100 files with one regex per file.
As my understanding, you want to find some key words from the input string. And those key words could be extened according your requirments.
your current solution is to make this regex (?:A|B|C|D) in your Constant class, wheneveer it's required, you'll add more key words in this regex.
If my understanding is not wrong, maybe, one suggestion is to put this regex in your properties file, like this
REGEX = (?:city|Animal|plant|student)
if too long, it's could be like this
REGEX = (?:city|Animal|plant|student|car|computer|clothes|\
furnature|others)
Your second idea, if my understanding is not wrong, is to put the keywords as the file name, and those files are put in one resource folder. therefore, you could obtain those files name to compose the final regexp. If your regex are always fixed as the (?:A|B|C|D) format, then this solution is good & convenient. (Every time, you add one new keyword file, you don't need to modify any source code & property file)
My question is fairly straightforward, even if the purpose it will serve is pretty complicated. I will use a simple example:
AzzAyyAxxxxByyBzzB
So normally I would want to get everything between A and B. However, because some of the content between the first A and the last B (one pair) contains additional AB pairs I need to push back the end of the match. (Not sure if that last part made sense).
So what I'm looking for is some RegEx that would allow me to have the following output:
Match 1
Group 1: AzzAyyAxxxxByyBzzB
Group 2: zzAyyAxxxxByyBzz
Then I would match it again to get:
Match 2
Group 1: AyyAxxxxByyB
Group 2: yyAxxxxByy
Then finally again to get:
Match 3
Group 1: AxxxxB
Group 2: xxxx
Obviously if I try (A(.*?)B) on the whole input I get:
Match x
Group 1: AzzAyyAxxxxB
Group 2: zzAyyAxxxx
Which is not what I'm looking for :)
I hope this makes sense. I understand if this can't be done in RegEx, but I thought I would ask some of you regex wizards before I give up on it and try something else. Thanks!
Additional Info:
The project I'm working on is written in Java.
One other problem is that I'm parsing a document which could contain something like this:
AzzAyyAxxxxByyBzzB
Here is some unrelated stuff
AzzAyyAxxxxByyBzzB
AzzzBxxArrrBAssssB
And the top AB pairs needs to be separate from the bottom AB pairs
You made your regex explicitly ungreedy by using the ?. Just leave it out and the regex will consume as much as possible before matching the B:
(A(.*)B)
However, in general nested structures are beyond the scope of regular expressions. In a case like this:
AxxxByyyAzzzB
You would now also match from the first A to the last B. If this is possible in your scenario, you might be better of going through the string yourself character-by-character and counting As and Bs to figure out which ones belong together.
EDIT:
Now that you have updated the question and we figured this out in the comments, you do have the problem of multiple consecutive pairs. In this case, this cannot be done with a regex engine that does not support recursion.
However you can switch to matching from the inside out.
A([^AB]*)B
This will only get innermost pairs, because there can be neither an A nor a B between the delimiters. If you find it, you can then remove the pair and continue with your next match.
Use word boundary if you use multiline mode:
\bA(.*)B\b #for matches that does not start from beginning of line to end
or
^A(.*)B$ #for matches that start from beginning of line till end
You won't be able to do this with Regular Expressions alone. What you're describing is more Context-Free than Regular. In order to parse something like this you need to push a new context onto a stack every time to encounter an 'A' and pop the stack every time you encounter a 'B'. You need something more like a pushdown automaton than a regular expression.
I know regular expressions aren't necessarily the best tool for the job, but I was wondering whether this would be possible with Java regexen:
Let's say I have a data set with names separated by line breaks like so:
John Doe
Jane Roe
Richard Miles
(there are naturally a lot more names in the actual system)
I'll be reading in data where I'll get both the first and the last name separately, but they won't necessarily be in the same order.
Now, the question is, is there any way to construct a regex for, say, Richard Miles that would match both "Miles Richard" and "Richard Miles"? I know there are plenty of other ways to do this, but I'm specifically looking for a regex-based solution (not because it's necessarily practical, but I'd find it interesting)
Edit for clarification: what I mean is that I need a regex for, say, "Richard Miles" that will match on both "Richard Miles" and "Miles Richard", and preferably not just (Richard Miles|Miles Richard) because where's the fun in that?
This is in no way supposed to be practical, I'm merely interested whether regexen can do something like this.
Does it need to be complex and clever? I mean this would work.
\b(Miles Richard|Richard Miles)\b
Yes, take a look at this: -
^(\\w+)\\s(\\w+)$
It will match a word at the beginning(^\\w+), followed by a space (\\s), followed by another word at the end(\\w+$)
Are you looking for this only?
I have 100 words. All 100 words are look like this.
EnglishWord,EngMeaning,NumberofW… meaning,31
In that I want to retrieve EnglishWord, e.g. Friendship alone for 100 words by using Java program.
I am assuming you have a "body" (main string), containing a list of substrings and you want to retrieve any specific one substring from within.
This looks a lot like homework/exercise, so I'll avoid giving you a ready-to-roll answer, since you need to achieve a solution yourself for it to be of any value, but the general steps you will need are the following:
1:
Be able to separate each substring (entry) from the others (the base string) in an organized fashion.
This can be done (for the string case), as #kylc said, with String's split function, which uses a REGEX (PATTERN) to define divisors (one or more), that then is/are used to divide the string into an array of multiple substrings.
String[] arrayOfEntries /*something to hold the result*/ = yourStringVar.split("," /*your split regex pattern*/);
NOTE: For more information on these, here are the links: String's split function, Pattern.
2:
Be able to acquire any specific entry withing an array of entries.
This is best done with a function you can reuse for other works. You need to define a "target" (what/which is going to be acquired) and a "source" (group of entries to acquire "target" from).
All you have to do is loop the "source", and for each entry there, compare to "target" for a match; When a match is found, just return it.
That's it! The rest is up to you!