renameing .fromFilePairs with regex capture group in closure - java

I'm new to nextflow/groovy/java and i'm running into some difficulty with a simple regular expression task.
I'm trying to alter the labels of some file pairs.
It is my understanding that fromFilePairs returns a data structure of the form:
[
[common_prefix, [file1, file2]],
[common_prefix, [file3, file4]]
]
I further thought that:
The .name method when invoked on a item from this list will give the name, what I have labelled above as common_prefix
The value returned by a closure used with fromFilePairs sets the names of the file pairs.
The value of it in a closure used with fromFilePairs is a single item from the list of file pairs.
however, I have tried many variants on the following without success:
params.fastq = "$baseDir/data/fastqs/*_{1,2}_*.fq.gz"
Channel
.fromFilePairs(params.fastq, checkIfExists:true) {
file ->
// println file.name // returned the common file prefix as I expected
mt = file.name =~ /(common)_(prefix)/
// println mt
// # java.util.regex.Matcher[pattern=(common)_(prefix) region=0,47 lastmatch=]
// match objects appear empty despite testing with regexs I know to work correctly including simple stuff like (.*) to rule out issues with my regex
// println mt.group(0) // #No match found
mt.group(0) // or a composition like mt.group(0) + "-" + mt.group(1)
}
.view()
I've also tried some variant on this using the replaceAll method.
I've consulted documentation for, nextflow, groovy and java and I still can't figure out what I'm missing. I expect it's some stupid syntactic thing or a misunderstanding of the data structure but I'm tired of banging my head against it when it's probably obvious to someone who knows the language better - I'd appreciate anyone who can enlighten me on how this works.

A closure can be provided to the fromfilepairs operator to implement a custom file pair grouping strategy. It takes a file and should return the grouping key. The example in the docs just groups the files by their file extensions:
Channel
.fromFilePairs('/some/data/*', size: -1) { file -> file.extension }
.view { ext, files -> "Files with the extension $ext are $files" }
This isn't necessary if all you want to do is alter the labels of some file pairs. You can use the map operator for this. The fromFilePairs op emits tuples in which the first element is the 'grouping key' of the matching pair and the second element is the 'list of files' (sorted lexicographically):
Channel
.fromFilePairs(params.fastq, checkIfExists:true) \
.map { group_key, files ->
tuple( group_key.replaceAll(/common_prefix/, ""), files )
} \
.view()

Related

Custom SnakeYAML dump styles

I want to make custom dump styles in different cases, for example I have that sample code:
DumperOptions options = new DumperOptions();
options.setDefaultFlowStyle(DumperOptions.FlowStyle.BLOCK);
options.setDefaultScalarStyle(DumperOptions.ScalarStyle.PLAIN);
Yaml yaml = new Yaml(options);
Map<Object, Object> map = new LinkedHashMap<>();
map.put("list", new ArrayList<>(Arrays.asList("entry1", "entry2")));
map.put("multiline", "line 1\nline 2\nline 3");
map.put("oneline", "line");
map.put("oneline-special", "line with #");
map.put("oneline-special #", "line with #");
yaml.dump(map, fileWriter);
Dump result is:
list:
- entry1
- entry2
multiline: |-
line 1
line 2
line 3
oneline: line
oneline-special: 'line with #'
'oneline-special #': 'line with #'
Problem:
I want to have double quoted value in any case, if it's a string key: "value", and if only needed for key, then: "key": "value". Also I need to save DumperOptions.ScalarStyle.PLAIN in order to support pretty style multiline strings output.
I tried to find anything related to that, found few info about Representer extending, but seems it cannot solve my problem with explicit style (no quotes on key, but double on value), I thought about extending Emitter, but it's final class so I can't use it without rewriting part of library.
So, my final result should be:
list:
- "entry1"
- "entry2"
multiline: |-
line 1
line 2
line 3
oneline: "line"
oneline-special: "line with #"
"oneline-special #": "line with #"
number: 512
Any solutions? Need your help. Thanks in advance.
As no another solution was provided, I solved it by directly changing processScalar() method in Emitter class. First added check to force double quoting if scalar is not a key and not a multiline (because I wanna plain style for multiline):
if (!simpleKeyContext && !analysis.multiline) {
style = ScalarStyle.DOUBLE_QUOTED;
}
Then changed switch case logic, where in case of SINGLE_QUOTED ScalarStyle we write as double, so, if needed, the key will be written in double quoted style.
Runned JUnit tests with simple key value and different styles, multiline case and list case. All is right and shine.

Regarding a data structure for O(1) get on prefixes

So I am trying to write a little utility in Scala that constantly listens on a bunch of directories for file system changes (deletes, creates, modifications etc) and rsyncs it immediately across to a remote server. (https://github.com/Khalian/LockStep)
My configurations are stored in JSON as the follows:-
{
"localToRemoteDirectories": {
"/workplace/arunavs/third_party": {
"remoteDir": "/remoteworkplace/arunavs/third_party",
"remoteServerAddr": "some Remote server address"
}
}
}
This configuration is stored in a Scala Map (key = localDir, value = (remoteDir, remoteServerAddr)). The tuple is represented as a case class
sealed case class RemoteLocation(remoteDir:String, remoteServerAddr:String)
I am using an actor from a third party:
https://github.com/lloydmeta/schwatcher/blob/master/src/main/scala/com/beachape/filemanagement/FileSystemWatchMessageForwardingActor.scala)
that listens on these directories (e.g. /workplace/arunavs/third_party and then outputs an Java 7 WatchKind event (EVENT_CREATE, EVENT_MODIFY etc). The problem is that the events sent are absolute path (for instance if I create a file helloworld in third_party dir, the message sent by the actor is (ENTRY_CREATE, /workplace/arunavs/third_party/helloworld))
I need a way to write a getter that gets the nearest prefix from the configuration map stored above. The obvious way to do it is to filter on the map:-
def getRootDirsAndRemoteAddrs(localDir:String) : Map[String, RemoteLocation] =
localToRemoteDirectories.filter(e => localDir.startsWith(e._1))
This simply returns the subset of keys that are a prefix to the localDir (in the above example this method is called with localDir = /workplace/arunavs/third_party/helloworld. While this works, this implementation is O(n) where n is the number of items in my configuration. I am looking for better computational complexity (I looked at radix and patricia tries, but they dont cut it since I feeding a string and trying to get keys which are prefixes to it, tries solve the opposite problem).

Hashmap single key holding a class. count the key and retrieve counter

I am working on a database self project. I have an input file got from: http://ir.dcs.gla.ac.uk/resources/test_collections/cran/
After processing into 1400 separate file, each named 00001.txt,... 01400.txt...) and after applying Stemming on them, I will store them separately in a specific folder lets call it StemmedFolder with the following format:
in StemmedFolder: 00001.txt includes:
investig
aerodynam
wing
slipstream
brenckman
experiment
investig
aerodynam
wing
in StemmedFolder: 00756.txt includes:
remark
eddi
viscos
compress
mix
flow
lu
ting
And so on....
I wrote the codes that do:
get the StemmedFolder, Count the Unique words
Sort Alphabetically
Add the ID of the document
save each to a new file 00001.txt to 01400.txt as will be described
{I can provide my codes for these 4 sections in case somebody needs to see how is the implementation or change or any edit}
output of each file will be result to a separate file. (1400, each named 00001.txt, 00002.txt...) in a specific folder lets call it FrequenceyFolder with the following format:
in FrequenceyFolder: 00001.txt includes:
00001,aerodynam,2
00001,agre,3
00001,angl,1
00001,attack,7
00001,basi,4
....
in FrequenceyFolder: 00999.txt includes:
00999,aerodynam,5
00999,evalu,1
00999,lift,3
00999,ratio,2
00999,result,9
....
in FrequenceyFolder: 01400.txt includes:
01400,subtract,1
01400,support,1
01400,theoret,1
01400,theori,1
01400,.....
______________
Now my question:
I need to combine these 1400 files again to output a txt file that looks like this format with some calculation:
'aerodynam' totalFrequency=3docs: [[Doc_00001,5],[Doc_01344,4],[Doc_00123,3]]
'book' totalFrequncy=2docs: [[Doc_00562,6],[Doc_01111,1]
....
....
'result' totalFrequency=1doc: [[Doc_00010,5]]
....
....
'zzzz' totalFrequency=1doc: [[Doc_01235,1]]
Thanks for spending time reading this long post
You can use a Map of List.
Map<String,List<FileInformation>> statistics = new HashMap<>()
In the above map, the key will be the word and the value will be a List<FileInformation> object describing the statistics of individual files containing the word. The FileInformation class can be declared as follows :
class FileInformation {
int occurrenceCount;
String fileName;
//getters and setters
}
To populate the above Map, use the following steps :
Read each file in the FrequencyFolder
When you come across a word for the first time, put it as a key in the Map.
Create a FileInformation object and set the occurrenceCount to the number of occurrences found and set the fileName to the name of the file it was found in. Add this object in the List<FileInformation> corresponding to the key created in step 2.
The next time you come across the same word in another file, create a new FileInfomation object and add it to the List<FileInformation> corresponding to the entry in the map for the word.
Once you have the Map populated, printing the statistics should be a piece of cake.
for(String word : statistics.keySet()) {
List<FileInformation> fileInfos = statistics.get(word);
for(FileInformation fileInfo : fileInfos) {
//sum up the occureneceCount for the word to get the total frequency
}
}

What does the #sign do?

I have seen the at (#) sign in Groovy files and I don't know if it's a Groovy or Java thing. I have tried to search on Google, Bing, and DuckDuckGo for the mystery at sign, but I haven't found anything. Can anyone please give me a resource to know more about what this operator does?
It's a Java annotation. Read more at that link.
As well as being a sign for an annotation, it's the Groovy Field operator
In Groovy, calling object.field calls the getField method (if one exists). If you actually want a direct reference to the field itself, you use #, ie:
class Test {
String name = 'tim'
String getName() {
"Name: $name"
}
}
def t = new Test()
println t.name // prints "Name: tim"
println t.#name // prints "tim"
'#' is an annotations in java/ Groovy look at the demo :Example with code
Java 5 and above supports the use of annotations to include metadata within programs. Groovy 1.1 and above also supports such annotations.
Annotations are used to provide information to tools and libraries.
They allow a declarative style of providing metadata information and allow it to be stored directly in the source code.
Such information would need to otherwise be provided using non-declarative means or using external files.
It can also be used to access attributes when parsing XML using Groovy's XmlSlurper:
def xml = '''<results><result index="1"/></results>'''
def results = new XmlSlurper().parseText(xml)
def index = results.result[0].#index.text() // prints "1"
http://groovy.codehaus.org/Reading+XML+using+Groovy's+XmlSlurper

How to write a Ruby-regex pattern in Java (includes recursive named-grouping)?

well... i have a file containing tintin-script. Now i already managed to grab all actions and substitutions from it to show them properly ordered on a website using Ruby, which helps me to keep an overview.
Example TINTIN-script
#substitution {You tell {([a-zA-Z,\-\ ]*)}, %*$}
{<279>[<269> $sysdate[1]<279>, <269>$systime<279> |<219> Tell <279>] <269>to <219>%2<279> : <219>%3}
{4}
#substitution {{([a-zA-Z,\-\ ]*)} tells you, %*$}
{<279>[<269> $sysdate[1]<279>, <269>$systime<279> |<119> Tell <279>] <269>from <119>%2<279> : <119>%3}
{2}
#action {Your muscles suddenly relax, and your nimbleness is gone.}
{
#if {$sw_keepaon}
{
aon;
};
} {5}
#action {xxxxx}
{
#if {$sw_keepfamiliar}
{
familiar $familiar;
};
} {5}
To grab them in my Ruby-App i read my script-file into a varibable 'input' and then use the following pattern to scan the 'input'
pattern = /(?<braces>{([^{}]|\g<braces>)*}){0}^#(?<type>action|substitution)\s*(?<b1>\g<braces>)\s*(?<b2>\g<braces>)\s*(?<b3>\g<braces>)/im
input = ""
File.open("/home/igambin/lmud/lmud.tt") { |file| input = file.read }
input.scan(pattern) { |prio, type, pattern, code|
## here i usually create objects, but for simplicity only output now
puts "Type : #{type}"
puts "Pattern : #{pattern}"
puts "Priority: #{prio}"
puts "Code :\n#{code}"
puts
}
Now my idea was to use the netbeans platform to write a module to not only keep an overview but also to assist editing the tintin script file. So opening the file in an Editor-Window I still need to parse the tintin-file and have all 'actions' and 'substitutions' from the file grabbed and displayed in an eTable, in wich I could dbl-click on one item to open a modification-window.
I've setup the module and got everything ready so far, i just can't figure out how to translate the ruby-regex pattern i've written to a working java-regex-pattern. It seems named-group-capturing and especially the recursive application of these groups is not supported in Java. Without that I seem to be unable to find a working solution...
Here's the ruby pattern again...
pattern = /(?<braces>{([^{}]|\g<braces>)*}){0}^#(?<type>action|substitution)\s*(?<b1>\g<braces>)\s*(?<b2>\g<braces>)\s*(?<b3>\g<braces>)/im
Can anyone help me to create a java pattern that matches the same?
Many thanks in advance for tips/hints/ideas and especially for solutions or (close-to-solution comments)!
Your text format seems pretty simple; it's possible you don't really need recursive matching. This Java-compatible regex matches your sample data correctly, as far as I can tell:
(?s)#(substitution|action)\s*\{(.*?)\}\s*\{(.*?)\}\s*\{(\d+)\}
Would that work for you? If you run Java 7, you can even name the groups. ;)
Can anyone help me to create a java pattern that matches the same?
No, no one can: Java's regex engine does not support recursive patterns (as Ruby 1.9 does).

Categories