Error importing BioFormats file into Matlab - java

I'm trying to import .spl files from Slidebook v4.2 into Matlab but I've run into problems.
I downloaded the functions and loci_tools.jar from here. I used them to import one file with minor problems (it got the Z planes and the time points backwards as well as misnaming some of the files with the wrong acquisition channel) but I figured out the pattern to the problems and was able to work around them.
Then I tried to import another file and I got this error which I haven't been able to solve. Any ideas would be greatly appreciated. I'm new to working with java and java in matlab. Here is the error I get:
I = bfopen(‘filename.spl’);
Finding offsets to pixel data
Determining dimensions
Reading series #1
.Error using loci.formats.ChannelSeparator/openBytes
Java exception occurred:
java.lang.IllegalArgumentException: Negative position
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:600)
at loci.common.NIOByteBufferProvider.allocateDirect(NIOByteBufferProvider.java:133)
at loci.common.NIOByteBufferProvider.allocate(NIOByteBufferProvider.java:118)
at loci.common.NIOFileHandle.buffer(NIOFileHandle.java:532)
at loci.common.NIOFileHandle.seek(NIOFileHandle.java:254)
at loci.common.RandomAccessInputStream.seek(RandomAccessInputStream.java:140)
at loci.formats.in.SlidebookReader.openBytes(SlidebookReader.java:130)
at loci.formats.ImageReader.openBytes(ImageReader.java:414)
at loci.formats.ChannelFiller.openBytes(ChannelFiller.java:197)
at loci.formats.ChannelSeparator.openBytes(ChannelSeparator.java:226)
at loci.formats.ChannelSeparator.openBytes(ChannelSeparator.java:159)
Error in bfGetPlane (line 75)
plane = r.openBytes(iPlane - 1, ip.Results.x - 1, ip.Results.y - 1, ...
Error in bfopen (line 144)
arr = bfGetPlane(r, i, varargin{:});

Please try the latest version of Bio-Formats 5. You can easily use it in Fiji by enabling the Bio-Formats 5 update site, or in MATLAB using the latest loci_tools.jar from Jenkins.
If you still get an error, feel free to report a bug. That said, the recommended approach is to export your data from the Slidebook software to OME-TIFF format.
Unfortunately, though popular, 3i Slidebook is probably the most arcane and difficult format we try to support in Bio-Formats. We have met with the Slidebook developers on multiple occasions to discuss how best to handle the issue. But the SLD format was never intended for public consumption and it continues to evolve with each iteration of the Slidebook software. So the compromise we settled on is for the Slidebook software to support robust export to OME-TIFF format, which preserves rich microscopy-related metadata. From the Bio-Formats 3i Slidebook page:
We strongly encourage users to export their .sld files to OME-TIFF using the SlideBook software. Bio-Formats is not likely to support the full range of metadata that is included in .sld files, and so exporting to OME-TIFF from SlideBook is the best way to ensure that all metadata is preserved.
I know that is not entirely satisfactory, but it is unlikely to change any time soon. Maybe if many customers expressed a strong preference to the Slidebook team to make the SLD format work better in Bio-Formats, they would take some steps to rework the format...
See also: Bio-Formats FAQ: Why do my Slidebook files take a long time to open?

Related

Serialization InvalidClass best practice

I am making an application in Java which uses files to store information with serialization. The trouble I ran into was that everytime I update one of my classes thats being store I obviously get InvalidClassException. The process I am following for now is that I just rm all the files and rebuild them. Obviously thats tidious with 5 Users,and I couldnt continue it with 10. Whats the standard best practice when updating Serialized objects to not lose the information from the files?
Mostly?
Stop using java's baked-in serialization. It sucks. This isn't just an opinion - the OpenJDK engineers themselves routinely raise fairly serious eyebrows when the topic of java's baked in serialization mechanism (ObjectInputStream / ObjectOutputStream comes up). In particular:
It is binary.
It is effectively unspecified; you will never be reading or writing it with anything other than java code.
Assuming multiple versions are involved (and it is - that's what your question is all about), it is extremely convoluted and requires advanced java knowledge to try to spin together some tests to ensure that things are backwards/forwards compatible as desired.
The format is not particularly efficient even though it is binary.
The API is weirdly un-java-like (with structural typing even, that's.. bizarre).
So what should I do?
You use an explicit serializer: A library that you include which does the serialization. There are many options. You can use GSON or Jackson to turn your object into a JSON string, and then store that. JSON is textual, fairly easy to read, and can be read and modified by just about any language. Because you 'control' what happens, its a lot simpler to tweak the format and define what is supposed to happen (e.g. if you add a new field in the code, you can specify what the default should be in your Jackson or GSON annotations, and that's the value that you get when you read in a file written with a version of your class that didn't have that field).
JSON is not efficient on disk at all, but its trivial to wrap your writes and reads with GZipInputStream / GZipOutputStream if that's an issue.
An alternative is protobuf. It is more effort but you end up with a binary data format that is fairly compact even if not compressed, and can still be read and written to from many, many languages, and which also parses way faster (this is irrelevant, computers are so fast, the bottleneck will be network or disk, but, if you're reading this stuff on battery-powered raspberry pis or what not, it matters).
I really want to stick with java's baked-in serialization
Read the docs, then. The specific part you want here is what serialVersionUID is all about, but there are so many warts and caveats, you should mostly definitely not just put an svuid in and move on with life - you'll run into the next weird bug in about 5 seconds. Read it all, experiment, attempt to understand it fully.
Then give up, realize it's a mess and ridiculously complicated to test properly, and use one of the above options.

apache beam pipeline ingesting "Big" input file (more than 1GB) doesn't create any output file

Regarding the dataflow model of computation, I'm doing a PoC to test a few concepts using apache beam with the direct-runner (and java sdk). I'm having trouble creating a pipeline which reads a "big" csv file (about 1.25GB) and dumping it into an output file without any particular transformation like in the following code (I'm mainly concerned with testing IO bottlenecks using this dataflow/beam model because that's of primary importance for me):
// Example 1 reading and writing to a file
Pipeline pipeline = Pipeline.create();
PCollection<String> output = ipeline
.apply(TextIO.read().from("BIG_CSV_FILE"));
output.apply(
TextIO
.write()
.to("BIG_OUTPUT")
.withSuffix("csv").withNumShards(1));
pipeline.run();
The problem that I'm having is that only smaller files do work, but when the big file is used, no output file is being generated (but also no error/exception is shown either, which makes debugging harder).
I'm aware that on the runners page of the apache-beam project (https://beam.apache.org/documentation/runners/direct/), it is explicitly stated under the memory considerations point:
Local execution is limited by the memory available in your local environment. It is highly recommended that you run your pipeline with
data sets small enough to fit in local memory. You can create a small
in-memory data set using a Create transform, or you can use a Read
transform to work with small local or remote files.
This above suggests I'm having a memory problem (but sadly isn't being explicitly stated on the console, so I'm just left wondering here). I'm also concerned with their suggestion that the dataset should fit into memory (why isn't it reading from the file in parts instead of fitting the whole file/dataset into memory?)
A 2nd consideration I'd like to also add into this conversation would be (in case this is indeed a memory problem): How basic is the implementation of the direct runner? I mean, it isn't hard to implement a piece of code that reads from a big file in chunks, and also outputs to a new file (also in chunks), so that at no point in time the memory usage becomes a problem (because neither file is completely loaded into memory - only the current "chunk"). Even if the "direct-runner" is more of a prototyping runner to test semantics, would it be too much to expect that it should deal nicely with huge files? - considering that this is a unified model built for the ground up to deal with streaming where window size is arbitrary and huge data accumulation/aggregation before sinking it is a standard use-case.
So more than a question I'd deeply appreciate your feedback/comments regarding any of these points: have you notice IO constraints using the direct-runner? Am I overlooking some aspect or is the direct-runner really so naively implemented? Have you verified that by using a proper production runner like flink/spark/google cloud dataflow, this constraint disapears?
I'll eventually test with other runners like the flink or the spark one, but it feels underwhelming that the direct-runner (even if it is intended only for prototyping purposes) is having trouble with this first test I'm running on - considering the whole dataflow idea is based around ingesting, processing, grouping and distributing huge amounts of data under the umbrella of an unified batch/streaming model.
EDIT (to reflect Kenn's feedback):
Kenn, thanks for those valuable points and feedback, they have been of great help in pointing me towards relevant documentation. By your suggestion I've found out by profiling the application that the problem is indeed a java heap related one (that somehow is never shown on the normal console - and only seen on the profiler). Even though the file is "only" 1.25GB in size, internal usage goes beyond 4GB before dumping the heap, suggesting the direct-runner isn't "working by chunks" but is indeed loading everything in memory (as their doc says).
Regarding your points:
1- I believe that serialization and shuffling can very well still be achieved through a "chunk by chunk" implementation. Maybe I had a false expectation of what the direct-runner should be capable of, or I didn't fully grasp its intended reach, for now I'll refrain of doing non-functional type of tests while using the direct-runner.
2 - Regarding sharding. I believe the NumOfShards controls the parallelism (and amount of output files) at the write stage (processing before that should still be fully parallel, and only at the time of writing, will it use as many workers -and generate as many files- as explicitly provided). Two reasons to believe this are: first, the CPU profiler always show 8 busy "direct-runner-workers" -mirroring the amount of logical cores that my PC has-, independently on if I set 1 shard or N shards. The 2nd reason is what I understand from the documentation here (https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/io/WriteFiles.html) :
By default, every bundle in the input PCollection will be processed by
a FileBasedSink.WriteOperation, so the number of output will vary
based on runner behavior, though at least 1 output will always be
produced. The exact parallelism of the write stage can be controlled
using withNumShards(int), typically used to control how many files
are produced or to globally limit the number of workers connecting to
an external service. However, this option can often hurt performance:
it adds an additional GroupByKey to the pipeline.
One interesting thing here is that "additional GroupByKey added to the pipeline" is kind of undesired in my use case (I only desire results in 1 file, without any regard for order or grouping),
so probbly adding an extra "flatten" files step, after having the N sharded output files generated is a better approach.
3 - your suggestion for profiling was spot on, thanks.
Final Edit the direct runner is not intended for performance testing, only prototyping and well formedness of the data. It doen't have any mechanism of spliting and dividing work by partitions, and handles everything in memory
There are a few issues or possibilities. I will answer in priority order.
The direct runner is for testing with very small data. It is engineered for maximum quality assurance, with performance not much of a priority. For example:
it randomly shuffles data to make sure you are not depending on ordering that will not exist in production
it serializes and deserializes data after each step, to make sure the data will be transmitted correctly (production runners will avoid serialization as much as possible)
it checks whether you have mutated elements in forbidden ways, which would cause you data loss in production
The data you are describing is not very big, and the DirectRunner can process it eventually in normal circumstances.
You have specified numShards(1) which explicitly eliminates all parallelism. It will cause all of the data to be combined and processed in a single thread, so it will be slower than it could be, even on the DirectRunner. In general, you will want to avoid artificially limiting parallelism.
If there is any out of memory error or other error preventing processing, you should see a lot message. Otherwise, it will be helpful to look at profiling and CPU utilization to determine if processing is active.
This question has been indirectly answered by Kenn Knowles above. The direct runner is not intended for performance testing, only prototyping and well formedness of the data. It doen't have any mechanism of spliting and dividing work by partitions, and handles every dataset in memory. Performance testing should be carried on by using other runners (like Flink Runner), - those will provide data splitting and the type of infrastructure needed to deal with high IO bottlenecks.
UPDATE: adding to the point adressed by this question, there is a related question here: How to deal with (Apache Beam) high IO bottlenecks?
Whereas the question here revolves around figuring out if the direct runner can deal with huge datasets (which we already established here that it is not possible); the provided link above points to a discussion of weather production runners (like flink/spark/cloud dataflow) can deal natively out of the box with huge datasets (the short answer is yes, but please check yourself on the link for a deeper discussion).

Identify an english word as a thing or product?

Write a program with the following objective -
be able to identify whether a word/phrase represents a thing/product. For example -
1) "A glove comprising at least an index finger receptacle, a middle finger receptacle.." <-Be able to identify glove as a thing/product.
2) "In a window regulator, especially for automobiles, in which the window is connected to a drive..." <- be able to identify regulator as a thing.
Doing this tells me that the text is talking about a thing/product. as a contrast, the following text talks about a process instead of a thing/product -> "An extrusion coating process for the production of flexible packaging films of nylon coated substrates consisting of the steps of..."
I have millions of such texts; hence, manually doing it is not feasible. So far, with the help of using NLTK + Python, I have been able to identify some specific cases which use very similar keywords. But I have not been able to do the same with the kinds mentioned in the examples above. Any help will be appreciated!
What you want to do is actually pretty difficult. It is a sort of (very specific) semantic labelling task. The possible solutions are:
create your own labelling algorithm, create training data, test, eval and finally tag your data
use an existing knowledge base (lexicon) to extract semantic labels for each target word
The first option is a complex research project in itself. Do it if you have the time and resources.
The second option will only give you the labels that are available in the knowledge base, and these might not match your wishes. I would give it a try with python, NLTK and Wordnet (interface already available), you might be able to use synset hypernyms for your problem.
This task is called named entity reconition problem.
EDIT: There is no clean definition of NER in NLP community, so one can say this is not NER task, but instance of more general sequence labeling problem. Anyway, there is still no tool that can do this out of the box.
Out of the box, Standford NLP can only recognize following types:
Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical
(MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION,
SET) entities
so it is not suitable for solving this task. There are some commercial solutions that possible can do the job, they can be readily found by googling "product name named entity recognition", some of them offer free trial plans. I don't know any free ready to deploy solution.
Of course, you can create you own model by hand-annotating about 1000 or so product name containing sentences and training some classifier like Conditional Random Field classifier with some basic features (here is documentation page that explains how to that with stanford NLP). This solution should work reasonable well, while it won't be perfect of course (no system will be perfect but some solutions are better then others).
EDIT: This is complex task per se, but not that complex unless you want state-of-the art results. You can create reasonable good model in just 2-3 days. Here is (example) step-by-step instruction how to do this using open source tool:
Download CRF++ and look at provided examples, they are in a simple text format
Annotate you data in a similar manner
a OTHER
glove PRODUCT
comprising OTHER
...
and so on.
Spilt you annotated data into two files train (80%) and dev(20%)
use following baseline template features (paste in template file)
U02:%x[0,0]
U01:%x[-1,0]
U01:%x[-2,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
4.Run
crf_learn template train.txt model
crf_test -m model dev.txt > result.txt
Look at result.txt. one column will contain your hand-labeled data and other - machine predicted labels. You can then compare these, compute accuracy etc. After that you can feed new unlabeled data into crf_test and get your labels.
As I said, this won't be perfect, but I will be very surprised if that won't be reasonable good (I actually solved very similar task not long ago) and certanly better just using few keywords/templates
ENDNOTE: this ignores many things and some best-practices in solving such tasks, won't be good for academic research, not 100% guaranteed to work, but still useful for this and many similar problems as relatively quick solution.

Is EmpiricalDistributionImpl broken on Commons-Math 3.3?

I have been using EmpiricalDistributionImpl from Apache Commons-Math library for quite a while now, upgrading from 2.x to 3.3 I am experiencing some problems.
First off NaNs seem to be causing problems during load() in this version, I am pretty sure they were not problematic before. Then the real problem is that I am getting negative values from my EDI class by getNextValue() even though all of the values I have loaded are strictly positive. Specifically my values are positive ratios in (0, +Inf) range, and if I plot them it's pretty top heavy (i.e like 90-95% values end up in the top 3 bins).
FWIW, I have found the following two bug reports but not sure they are entirely related.
https://issues.apache.org/jira/browse/MATH-1132
https://issues.apache.org/jira/browse/MATH-984
They both appear to be fixed and scheduled for 3.4 release, except there is no ETA on the release date.
Suggestions?
MATH-1132 is unrelated; but MATH-984 likely is related to the data range problem you mention. NaNs should be filtered before data are passed to load, as there is no meaningful way to handle them (without adding support for a NanStrategy, which is not currently supported).
Version 3.4 was just released.
Please open a new ticket if you still have range problems and feel free to open a ticket to get NaNs supported via a NaNStrategy.

Running clustering algorithms in ELKI

I need to run a k-medoids clustering algorithm by using ELKI programmatically. I have a similarity matrix that I wish to input to the algorithm.
Is there any code snippet available for how to run ELKI algorithms?
I basically need to know how to create Database and Relation objects, create a custom distance function, and read the algorithm output.
Unfortunately the ELKI tutorial (http://elki.dbs.ifi.lmu.de/wiki/Tutorial) focuses on the GUI version and on implementing new algorithms, and trying to write code by looking at the Javadoc is frustrating.
If someone is aware of any easy-to-use library for k-medoids, that's probably a good answer to this question as well.
We do appreciate documentation contributions! (Update: I have turned this post into a new ELKI tutorial entry for now.)
ELKI does advocate to not embed it in other applications Java for a number of reasons. This is why we recommend using the MiniGUI (or the command line it constructs). Adding custom code is best done e.g. as a custom ResultHandler or just by using the ResultWriter and parsing the resulting text files.
If you really want to embed it in your code (there are a number of situations where it is useful, in particular when you need multiple relations, and want to evaluate different index structures against each other), here is the basic setup for getting a Database and Relation:
// Setup parameters:
ListParameterization params = new ListParameterization();
params.addParameter(FileBasedDatabaseConnection.INPUT_ID, filename);
// Add other parameters for the database here!
// Instantiate the database:
Database db = ClassGenericsUtil.parameterizeOrAbort(
StaticArrayDatabase.class,
params);
// Don't forget this, it will load the actual data...
db.initialize();
Relation<DoubleVector> vectors = db.getRelation(TypeUtil.DOUBLE_VECTOR_FIELD);
Relation<LabelList> labels = db.getRelation(TypeUtil.LABELLIST);
If you want to program more general, use NumberVector<?>.
Why we do (currently) not recommend using ELKI as a "library":
The API is still changing a lot. We keep on adding options, and we cannot (yet) provide a stable API. The command line / MiniGUI / Parameterization is much more stable, because of the handling of default values - the parameterization only lists the non-default parameters, so only if these change you'll notice.
In the code example above, note that I also used this pattern. A change to the parsers, database etc. will likely not affect this program!
Memory usage: data mining is quite memory intensive. If you use the MiniGUI or command line, you have a good cleanup when the task is finished. If you invoke it from Java, changes are really high that you keep some reference somewhere, and end up leaking lots of memory. So do not use above pattern without ensuring that the objects are properly cleaned up when you are done!
By running ELKI from the command line, you get two things for free:
no memory leaks. When the task is finished, the process quits and frees all memory.
no need to rerun it twice for the same data. Subsequent analysis does not need to rerun the algorithm.
ELKI is not designed as embeddable library for good reasons. ELKI has tons of options and functionality, and this comes at a price, both in runtime (although it can easily outperform R and Weka, for example!) memory usage and in particular in code complexity.
ELKI was designed for research in data mining algorithms, not for making them easy to include in arbitrary applications. Instead, if you have a particular problem, you should use ELKI to find out which approach works good, then reimplement that approach in an optimized manner for your problem.
Best ways of using ELKI
Here are some tips and tricks:
Use the MiniGUI to build a command line. Note that in the logging window of the "GUI" it shows the corresponding command line parameters - running ELKI from command line is easy to script, and can easily be distributed to multiple computers e.g. via Grid Engine.
#!/bin/bash
for k in $( seq 3 39 ); do
java -jar elki.jar KDDCLIApplication \
-dbc.in whatever \
-algorithm clustering.kmeans.KMedoidsEM \
-kmeans.k $k \
-resulthandler ResultWriter -out.gzip \
-out output/k-$k
done
Use indexes. For many algorithms, index structures can make a huge difference!
(But you need to do some research which indexes can be used for which algorithms!)
Consider using the extension points such as ResultWriter. It may be the easiest for you to hook into this API, then use ResultUtil to select the results that you want to output in your own preferred format or analyze:
List<Clustering<? extends Model>> clusterresults =
ResultUtil.getClusteringResults(result);
To identify objects, use labels and a LabelList relation. The default parser will do this when it sees text along the numerical attributes, i.e. a file such as
1.0 2.0 3.0 ObjectLabel1
will make it easy to identify the object by its label!
UPDATE: See ELKI tutorial created out of this post for updates.
ELKI's documentation is pretty sparse (I don't know why they don't include a simple "hello world" program in the examples)
You could try Java-ML. Its documentation is a bit more user friendly, and it does have K-medoid.
Clustering example with Java-ML |
http://java-ml.sourceforge.net/content/clustering-basics
K-medoid |
http://java-ml.sourceforge.net/api/0.1.7/

Categories