I am working on a database self project. I have an input file got from: http://ir.dcs.gla.ac.uk/resources/test_collections/cran/
After processing into 1400 separate file, each named 00001.txt,... 01400.txt...) and after applying Stemming on them, I will store them separately in a specific folder lets call it StemmedFolder with the following format:
in StemmedFolder: 00001.txt includes:
investig
aerodynam
wing
slipstream
brenckman
experiment
investig
aerodynam
wing
in StemmedFolder: 00756.txt includes:
remark
eddi
viscos
compress
mix
flow
lu
ting
And so on....
I wrote the codes that do:
get the StemmedFolder, Count the Unique words
Sort Alphabetically
Add the ID of the document
save each to a new file 00001.txt to 01400.txt as will be described
{I can provide my codes for these 4 sections in case somebody needs to see how is the implementation or change or any edit}
output of each file will be result to a separate file. (1400, each named 00001.txt, 00002.txt...) in a specific folder lets call it FrequenceyFolder with the following format:
in FrequenceyFolder: 00001.txt includes:
00001,aerodynam,2
00001,agre,3
00001,angl,1
00001,attack,7
00001,basi,4
....
in FrequenceyFolder: 00999.txt includes:
00999,aerodynam,5
00999,evalu,1
00999,lift,3
00999,ratio,2
00999,result,9
....
in FrequenceyFolder: 01400.txt includes:
01400,subtract,1
01400,support,1
01400,theoret,1
01400,theori,1
01400,.....
______________
Now my question:
I need to combine these 1400 files again to output a txt file that looks like this format with some calculation:
'aerodynam' totalFrequency=3docs: [[Doc_00001,5],[Doc_01344,4],[Doc_00123,3]]
'book' totalFrequncy=2docs: [[Doc_00562,6],[Doc_01111,1]
....
....
'result' totalFrequency=1doc: [[Doc_00010,5]]
....
....
'zzzz' totalFrequency=1doc: [[Doc_01235,1]]
Thanks for spending time reading this long post
You can use a Map of List.
Map<String,List<FileInformation>> statistics = new HashMap<>()
In the above map, the key will be the word and the value will be a List<FileInformation> object describing the statistics of individual files containing the word. The FileInformation class can be declared as follows :
class FileInformation {
int occurrenceCount;
String fileName;
//getters and setters
}
To populate the above Map, use the following steps :
Read each file in the FrequencyFolder
When you come across a word for the first time, put it as a key in the Map.
Create a FileInformation object and set the occurrenceCount to the number of occurrences found and set the fileName to the name of the file it was found in. Add this object in the List<FileInformation> corresponding to the key created in step 2.
The next time you come across the same word in another file, create a new FileInfomation object and add it to the List<FileInformation> corresponding to the entry in the map for the word.
Once you have the Map populated, printing the statistics should be a piece of cake.
for(String word : statistics.keySet()) {
List<FileInformation> fileInfos = statistics.get(word);
for(FileInformation fileInfo : fileInfos) {
//sum up the occureneceCount for the word to get the total frequency
}
}
Related
I'm new to nextflow/groovy/java and i'm running into some difficulty with a simple regular expression task.
I'm trying to alter the labels of some file pairs.
It is my understanding that fromFilePairs returns a data structure of the form:
[
[common_prefix, [file1, file2]],
[common_prefix, [file3, file4]]
]
I further thought that:
The .name method when invoked on a item from this list will give the name, what I have labelled above as common_prefix
The value returned by a closure used with fromFilePairs sets the names of the file pairs.
The value of it in a closure used with fromFilePairs is a single item from the list of file pairs.
however, I have tried many variants on the following without success:
params.fastq = "$baseDir/data/fastqs/*_{1,2}_*.fq.gz"
Channel
.fromFilePairs(params.fastq, checkIfExists:true) {
file ->
// println file.name // returned the common file prefix as I expected
mt = file.name =~ /(common)_(prefix)/
// println mt
// # java.util.regex.Matcher[pattern=(common)_(prefix) region=0,47 lastmatch=]
// match objects appear empty despite testing with regexs I know to work correctly including simple stuff like (.*) to rule out issues with my regex
// println mt.group(0) // #No match found
mt.group(0) // or a composition like mt.group(0) + "-" + mt.group(1)
}
.view()
I've also tried some variant on this using the replaceAll method.
I've consulted documentation for, nextflow, groovy and java and I still can't figure out what I'm missing. I expect it's some stupid syntactic thing or a misunderstanding of the data structure but I'm tired of banging my head against it when it's probably obvious to someone who knows the language better - I'd appreciate anyone who can enlighten me on how this works.
A closure can be provided to the fromfilepairs operator to implement a custom file pair grouping strategy. It takes a file and should return the grouping key. The example in the docs just groups the files by their file extensions:
Channel
.fromFilePairs('/some/data/*', size: -1) { file -> file.extension }
.view { ext, files -> "Files with the extension $ext are $files" }
This isn't necessary if all you want to do is alter the labels of some file pairs. You can use the map operator for this. The fromFilePairs op emits tuples in which the first element is the 'grouping key' of the matching pair and the second element is the 'list of files' (sorted lexicographically):
Channel
.fromFilePairs(params.fastq, checkIfExists:true) \
.map { group_key, files ->
tuple( group_key.replaceAll(/common_prefix/, ""), files )
} \
.view()
I'd like to read and change zoom level of named destinations in a pdf file using iText 7. Ive come up with the following code:
Map<String, PdfObject> names =
document.getCatalog().getNameTree(PdfName.Dests).getNames();
for (Map.Entry<String, PdfObject> dest : names.entrySet()) {
if (dest.getValue().isArray()) {
PdfArray arr = (PdfArray) dest.getValue();
PdfName asName = arr.getAsName(1); // /Fit
arr.set(1, FitR);
//System.out.println();
arr.setModified();
}
}
However, this code fails to work against my example file and has other flaws as well.
Most importantly, it tries to deal with one type of zoom (/Fit), but other types (/XYZ and so on) should also be handled with. Second, I don't know how to get the page number of named destination as key pair named of destination and its zoom value doesn't seem to have this information. Please see a screenshot of debug session below:
Note, at SO there is already a question dealing with exactly the same topic. The thing is the answer to that question give too little information to deal with this problem.
I am new to OWL 2, and I want to parse a ".ttl" file with OWL API, but I found that OWL API is not same as the API I used before. It seems that I should write a "visitor" if I want to get the content within a OWLAxiom or OWLEntity, and so on. I have read some tutorials, but I didn't get the proper way to do it. Also, I found the tutorials searched were use older version of owl api. So I want a detailed example to parse a instance, and store the content to a Java class.
I have made some attempts, my codes are as follows, but I don't know to go on.
OWLOntologyManager manager = OWLManager.createOWLOntologyManager();
File file = new File("./source.ttl");
OWLOntology localAcademic = manager.loadOntologyFromOntologyDocument(file);
Stream<OWLNamedIndividual> namedIndividualStream = localAcademic.individualsInSignature();
Iterator<OWLNamedIndividual> iterator = namedIndividualStream.iterator();
while (iterator.hasNext()) {
OWLNamedIndividual namedIndividual = iterator.next();
}
Instance for example are as follows. Specially, I want store the "#en" in the object of "ecrm:P3_has_note".
<http://data.doremus.org/performance/4db95574-8497-3f30-ad1e-f6f65ed6c896>
a mus:M42_Performed_Expression_Creation ;
ecrm:P3_has_note "Créée par Teodoro Anzellotti, son commanditaire, en novembre 1995 à Rotterdam"#en ;
ecrm:P4_has_time-span <http://data.doremus.org/performance/4db95574-8497-3f30-ad1e-f6f65ed6c896/time> ;
ecrm:P9_consists_of [ a mus:M28_Individual_Performance ;
ecrm:P14_carried_out_by "Teodoro Anzellotti"
] ;
ecrm:P9_consists_of [ a mus:M28_Individual_Performance ;
ecrm:P14_carried_out_by "à Rotterdam"
] ;
efrbroo:R17_created <http://data.doremus.org/expression/2fdd40f3-f67c-30a0-bb03-f27e69b9f07f> ;
efrbroo:R19_created_a_realisation_of
<http://data.doremus.org/work/907de583-5247-346a-9c19-e184823c9fd6> ;
efrbroo:R25_performed <http://data.doremus.org/expression/b4bb1588-dd83-3915-ab55-b8b70b0131b5> .
The contents I want are as follows:
class Instance{
String subject;
Map<String, Set<Object>> predicateToObject = new HashMap<String,Set<Object>>();
}
class Object{
String value;
String type;
String language = null;
}
The version of owlapi I am using is 5.1.0. I download the jar and the doc from there. I just want to know how to get the content I need in the java class.
If there are some tutorials that describe the way to do it, please tell me.
Thanks a lot.
Update: I have known how to do it, when I finish it, I will write an answer, I hope it can help latecomers of OWLAPI.
Thanks again.
What you need, once you have the individual, is to retrieve the data property assertion axioms and collect the literals asserted for each property.
So, in the for loop in your code:
// Let's rename your Object class to Literal so we don't get confused with java.lang.Object
Instance instance = new Instance();
localAcademic.dataPropertyAssertionAxioms()
.forEach(ax -> instance.predicateToObject.put(
ax.getProperty().getIRI().toString(),
Collections.singleton(new Literal(ax.getObject))));
This code assumes properties only appear once - if your properties appear multiple times, you'll have to check whether a set already exists for the property and just add to it instead of replacing the value in the map. I left that out to simplify the example.
A visitor is not necessary for this scenario, because you already know what axiom type you're interested in and what methods to call on it. It could have been written as an OWLAxiomVisitor implementing only visit(OWLDataPropertyAssertionAxiom) but in this case there would be little advantage in doing so.
I have the following objects:
ITeamRepository repo;
IProjectArea projArea;
ITeamArea teamArea;
The process of obtaining the projArea and the teamArea is quite straightforward (despite the quantity of objects involved). However I can't seem to find a way to obtain a list with all the Workitems associated with these objects in a direct way. Is this directly possible, probably via the IQueryClient objects?
This 2012 thread (so it might have changed since) suggests:
I used the following code to get the work items associated with each project area:
auditableClient = (IAuditableClient) repository.getClientLibrary(IAuditableClient.class);
IQueryClient queryClient = (IQueryClient) repository.getClientLibrary(IQueryClient.class);
IQueryableAttribute attribute = QueryableAttributes.getFactory(IWorkItem.ITEM_TYPE).findAttribute(currProject, IWorkItem.PROJECT_AREA_PROPERTY, auditableClient, null);
Expression expression = new AttributeExpression(attribute, AttributeOperation.EQUALS, currProject);
IQueryResult<IResolvedResult<IWorkItem>> results = queryClient.getResolvedExpressionResults(currProject, expression, IWorkItem.FULL_PROFILE);
In my code, currProject would be the IProjectArea pointer to the current project as you loop through the List of project areas p in your code.
The IQueryResult object 'results' then contains a list of IResolvedResult records with all of the work items for that project you can iterate through and find properties for each work item.
So I am trying to write a little utility in Scala that constantly listens on a bunch of directories for file system changes (deletes, creates, modifications etc) and rsyncs it immediately across to a remote server. (https://github.com/Khalian/LockStep)
My configurations are stored in JSON as the follows:-
{
"localToRemoteDirectories": {
"/workplace/arunavs/third_party": {
"remoteDir": "/remoteworkplace/arunavs/third_party",
"remoteServerAddr": "some Remote server address"
}
}
}
This configuration is stored in a Scala Map (key = localDir, value = (remoteDir, remoteServerAddr)). The tuple is represented as a case class
sealed case class RemoteLocation(remoteDir:String, remoteServerAddr:String)
I am using an actor from a third party:
https://github.com/lloydmeta/schwatcher/blob/master/src/main/scala/com/beachape/filemanagement/FileSystemWatchMessageForwardingActor.scala)
that listens on these directories (e.g. /workplace/arunavs/third_party and then outputs an Java 7 WatchKind event (EVENT_CREATE, EVENT_MODIFY etc). The problem is that the events sent are absolute path (for instance if I create a file helloworld in third_party dir, the message sent by the actor is (ENTRY_CREATE, /workplace/arunavs/third_party/helloworld))
I need a way to write a getter that gets the nearest prefix from the configuration map stored above. The obvious way to do it is to filter on the map:-
def getRootDirsAndRemoteAddrs(localDir:String) : Map[String, RemoteLocation] =
localToRemoteDirectories.filter(e => localDir.startsWith(e._1))
This simply returns the subset of keys that are a prefix to the localDir (in the above example this method is called with localDir = /workplace/arunavs/third_party/helloworld. While this works, this implementation is O(n) where n is the number of items in my configuration. I am looking for better computational complexity (I looked at radix and patricia tries, but they dont cut it since I feeding a string and trying to get keys which are prefixes to it, tries solve the opposite problem).