I have a large S3 bucket full of photos of 4 different types of animals. My foray into ML will be to see if I can successfully get Deep Learning 4 Java (DL4J) to be shown a new arbitrary photo of one of those 4 species and get it to consistently, correctly guess which animal it is.
My understanding is that I must first perform a "training phase" which effectively builds up an (in-memory) neural network that consists of nodes and weights derived from both this S3 bucket (input data) and my own coding and usage of the DL4J library.
Once trained (meaning, once I have an in-memory neural net built up), then my understanding is that I can then enter zero or more "testing phases" where I give a single new image as input, let the program decide what type of animal it thinks the image is of, and then manually mark the output as being correct (the program guessed right) or incorrect w/ corrections (the program guessed wrong, and oh by the way, such and so was the correct answer). My understanding is that these test phases should help tweak you algorithms and minimize error.
Finally, it is my understanding that the library can then be used in a live "production phase" whereby the program is just responding to images as inputs and making decisions as to what it thinks they are.
All this to ask: is my understanding of ML and DL4J's basic methodology correction, or am I mislead in any way?
Training: That's any framework. You can also persist the neural network as well with either the java based SerializationUtils or in the newer release we have a ModelSerializer as well.
This is more of an integrations play than a "can it do x?"
DL4j can integrate with kafka/spark streaming and do online/mini batch learning.
The neural nets are embeddable in a production environment.
My only tip here is to ensure that you have the same data pipeline for training as well as test.
This is mainly for ensuring consistency of your data you are training vs testing on.
As well as for mini batch learning ensure you have minibatch(true) (default) if you are doing mini batch/online learning or minibatch(false) if you are training on the whole dataset at once.
I would also suggest using StandardScalar (https://github.com/deeplearning4j/nd4j/blob/master/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/org/nd4j/linalg/dataset/api/iterator/StandardScaler.java) or something similar for persisting global statistics around your data. Much of the data pipeline will depend on the libraries you are using to build your data pipeline though.
I would assume you would want to normalize your data in some way though.
Related
Regarding the dataflow model of computation, I'm doing a PoC to test a few concepts using apache beam with the direct-runner (and java sdk). I'm having trouble creating a pipeline which reads a "big" csv file (about 1.25GB) and dumping it into an output file without any particular transformation like in the following code (I'm mainly concerned with testing IO bottlenecks using this dataflow/beam model because that's of primary importance for me):
// Example 1 reading and writing to a file
Pipeline pipeline = Pipeline.create();
PCollection<String> output = ipeline
.apply(TextIO.read().from("BIG_CSV_FILE"));
output.apply(
TextIO
.write()
.to("BIG_OUTPUT")
.withSuffix("csv").withNumShards(1));
pipeline.run();
The problem that I'm having is that only smaller files do work, but when the big file is used, no output file is being generated (but also no error/exception is shown either, which makes debugging harder).
I'm aware that on the runners page of the apache-beam project (https://beam.apache.org/documentation/runners/direct/), it is explicitly stated under the memory considerations point:
Local execution is limited by the memory available in your local environment. It is highly recommended that you run your pipeline with
data sets small enough to fit in local memory. You can create a small
in-memory data set using a Create transform, or you can use a Read
transform to work with small local or remote files.
This above suggests I'm having a memory problem (but sadly isn't being explicitly stated on the console, so I'm just left wondering here). I'm also concerned with their suggestion that the dataset should fit into memory (why isn't it reading from the file in parts instead of fitting the whole file/dataset into memory?)
A 2nd consideration I'd like to also add into this conversation would be (in case this is indeed a memory problem): How basic is the implementation of the direct runner? I mean, it isn't hard to implement a piece of code that reads from a big file in chunks, and also outputs to a new file (also in chunks), so that at no point in time the memory usage becomes a problem (because neither file is completely loaded into memory - only the current "chunk"). Even if the "direct-runner" is more of a prototyping runner to test semantics, would it be too much to expect that it should deal nicely with huge files? - considering that this is a unified model built for the ground up to deal with streaming where window size is arbitrary and huge data accumulation/aggregation before sinking it is a standard use-case.
So more than a question I'd deeply appreciate your feedback/comments regarding any of these points: have you notice IO constraints using the direct-runner? Am I overlooking some aspect or is the direct-runner really so naively implemented? Have you verified that by using a proper production runner like flink/spark/google cloud dataflow, this constraint disapears?
I'll eventually test with other runners like the flink or the spark one, but it feels underwhelming that the direct-runner (even if it is intended only for prototyping purposes) is having trouble with this first test I'm running on - considering the whole dataflow idea is based around ingesting, processing, grouping and distributing huge amounts of data under the umbrella of an unified batch/streaming model.
EDIT (to reflect Kenn's feedback):
Kenn, thanks for those valuable points and feedback, they have been of great help in pointing me towards relevant documentation. By your suggestion I've found out by profiling the application that the problem is indeed a java heap related one (that somehow is never shown on the normal console - and only seen on the profiler). Even though the file is "only" 1.25GB in size, internal usage goes beyond 4GB before dumping the heap, suggesting the direct-runner isn't "working by chunks" but is indeed loading everything in memory (as their doc says).
Regarding your points:
1- I believe that serialization and shuffling can very well still be achieved through a "chunk by chunk" implementation. Maybe I had a false expectation of what the direct-runner should be capable of, or I didn't fully grasp its intended reach, for now I'll refrain of doing non-functional type of tests while using the direct-runner.
2 - Regarding sharding. I believe the NumOfShards controls the parallelism (and amount of output files) at the write stage (processing before that should still be fully parallel, and only at the time of writing, will it use as many workers -and generate as many files- as explicitly provided). Two reasons to believe this are: first, the CPU profiler always show 8 busy "direct-runner-workers" -mirroring the amount of logical cores that my PC has-, independently on if I set 1 shard or N shards. The 2nd reason is what I understand from the documentation here (https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/io/WriteFiles.html) :
By default, every bundle in the input PCollection will be processed by
a FileBasedSink.WriteOperation, so the number of output will vary
based on runner behavior, though at least 1 output will always be
produced. The exact parallelism of the write stage can be controlled
using withNumShards(int), typically used to control how many files
are produced or to globally limit the number of workers connecting to
an external service. However, this option can often hurt performance:
it adds an additional GroupByKey to the pipeline.
One interesting thing here is that "additional GroupByKey added to the pipeline" is kind of undesired in my use case (I only desire results in 1 file, without any regard for order or grouping),
so probbly adding an extra "flatten" files step, after having the N sharded output files generated is a better approach.
3 - your suggestion for profiling was spot on, thanks.
Final Edit the direct runner is not intended for performance testing, only prototyping and well formedness of the data. It doen't have any mechanism of spliting and dividing work by partitions, and handles everything in memory
There are a few issues or possibilities. I will answer in priority order.
The direct runner is for testing with very small data. It is engineered for maximum quality assurance, with performance not much of a priority. For example:
it randomly shuffles data to make sure you are not depending on ordering that will not exist in production
it serializes and deserializes data after each step, to make sure the data will be transmitted correctly (production runners will avoid serialization as much as possible)
it checks whether you have mutated elements in forbidden ways, which would cause you data loss in production
The data you are describing is not very big, and the DirectRunner can process it eventually in normal circumstances.
You have specified numShards(1) which explicitly eliminates all parallelism. It will cause all of the data to be combined and processed in a single thread, so it will be slower than it could be, even on the DirectRunner. In general, you will want to avoid artificially limiting parallelism.
If there is any out of memory error or other error preventing processing, you should see a lot message. Otherwise, it will be helpful to look at profiling and CPU utilization to determine if processing is active.
This question has been indirectly answered by Kenn Knowles above. The direct runner is not intended for performance testing, only prototyping and well formedness of the data. It doen't have any mechanism of spliting and dividing work by partitions, and handles every dataset in memory. Performance testing should be carried on by using other runners (like Flink Runner), - those will provide data splitting and the type of infrastructure needed to deal with high IO bottlenecks.
UPDATE: adding to the point adressed by this question, there is a related question here: How to deal with (Apache Beam) high IO bottlenecks?
Whereas the question here revolves around figuring out if the direct runner can deal with huge datasets (which we already established here that it is not possible); the provided link above points to a discussion of weather production runners (like flink/spark/cloud dataflow) can deal natively out of the box with huge datasets (the short answer is yes, but please check yourself on the link for a deeper discussion).
I am creating an evolution simulator in Java. The simulation consists of a map with cold/hot regions and high/low elevation, etc'.
I want the creatures in the world to evolve in two ways- every single creature will evolve it's AI during the course of his lifetime, and when a creature reproduces there is a chance for mutation.
I thought it would be good to make the brain of the creatures a neural network that takes the sensor's data as input (only eyes at the moment), and produces commands to the thrusters (which move the creature around).
However, I only have experience with basic neural networks that recieve desired inputs from the user and calculate the error accordingly. However in this simulator, there is no optimal result. Results can be rated by a fitness function I have created (which takes in count energy changes, amount of offsprings, etc'), but it is unknown which output node is wrong and which is right.
Am I using the correct approach for this problem? Or perhaps neural networks are not the best solution for it?
If it is a viable way to achieve what I desire, how can I make the neural network adjust the correct weights if I do not know them?
Thanks in advance, and sorry for any english mistakes.
You ran into a common problem with neural networks and games. As mentioned in the comments a Genetic algorithm is often used when there is no 'correct' solution.
So your goal is basically to somehow combine neural networks and genetic algorithms. Luckily somebody did this before and described the process in this paper.
Since the paper is relatively complex and it is very time consumeing to implemwnt the algorithm you should consider using a library.
Since I couldn't find any suiting library for me, I decided to write my own one, you can find it here
The library should work good enough for 'smaller' problems like yours. You will find some example code in the Main class.
Combine networks using
Network.breedWith(Network other);
Create networks using
Network net = new Network(int inputs, int outputs);
Mutate networks using
Network.innovate();
As you will see in the example code it is important to always have an initial amount of mutations for each new network. This is because when you create a new network there are no connections, so innovation (fancy word for mutation) is needed to create connections.
If needed you can always create copys of networks (Network.getCopy();). The Network class and all of its attributes implement serializable, so you can save/load a network using an ObjectOutputStream.
If you decide to use my library please let me know what results you got!
Write a program with the following objective -
be able to identify whether a word/phrase represents a thing/product. For example -
1) "A glove comprising at least an index finger receptacle, a middle finger receptacle.." <-Be able to identify glove as a thing/product.
2) "In a window regulator, especially for automobiles, in which the window is connected to a drive..." <- be able to identify regulator as a thing.
Doing this tells me that the text is talking about a thing/product. as a contrast, the following text talks about a process instead of a thing/product -> "An extrusion coating process for the production of flexible packaging films of nylon coated substrates consisting of the steps of..."
I have millions of such texts; hence, manually doing it is not feasible. So far, with the help of using NLTK + Python, I have been able to identify some specific cases which use very similar keywords. But I have not been able to do the same with the kinds mentioned in the examples above. Any help will be appreciated!
What you want to do is actually pretty difficult. It is a sort of (very specific) semantic labelling task. The possible solutions are:
create your own labelling algorithm, create training data, test, eval and finally tag your data
use an existing knowledge base (lexicon) to extract semantic labels for each target word
The first option is a complex research project in itself. Do it if you have the time and resources.
The second option will only give you the labels that are available in the knowledge base, and these might not match your wishes. I would give it a try with python, NLTK and Wordnet (interface already available), you might be able to use synset hypernyms for your problem.
This task is called named entity reconition problem.
EDIT: There is no clean definition of NER in NLP community, so one can say this is not NER task, but instance of more general sequence labeling problem. Anyway, there is still no tool that can do this out of the box.
Out of the box, Standford NLP can only recognize following types:
Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical
(MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION,
SET) entities
so it is not suitable for solving this task. There are some commercial solutions that possible can do the job, they can be readily found by googling "product name named entity recognition", some of them offer free trial plans. I don't know any free ready to deploy solution.
Of course, you can create you own model by hand-annotating about 1000 or so product name containing sentences and training some classifier like Conditional Random Field classifier with some basic features (here is documentation page that explains how to that with stanford NLP). This solution should work reasonable well, while it won't be perfect of course (no system will be perfect but some solutions are better then others).
EDIT: This is complex task per se, but not that complex unless you want state-of-the art results. You can create reasonable good model in just 2-3 days. Here is (example) step-by-step instruction how to do this using open source tool:
Download CRF++ and look at provided examples, they are in a simple text format
Annotate you data in a similar manner
a OTHER
glove PRODUCT
comprising OTHER
...
and so on.
Spilt you annotated data into two files train (80%) and dev(20%)
use following baseline template features (paste in template file)
U02:%x[0,0]
U01:%x[-1,0]
U01:%x[-2,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
4.Run
crf_learn template train.txt model
crf_test -m model dev.txt > result.txt
Look at result.txt. one column will contain your hand-labeled data and other - machine predicted labels. You can then compare these, compute accuracy etc. After that you can feed new unlabeled data into crf_test and get your labels.
As I said, this won't be perfect, but I will be very surprised if that won't be reasonable good (I actually solved very similar task not long ago) and certanly better just using few keywords/templates
ENDNOTE: this ignores many things and some best-practices in solving such tasks, won't be good for academic research, not 100% guaranteed to work, but still useful for this and many similar problems as relatively quick solution.
One of the projects I'm currently working on includes a Java Swing application for field users to input data about equipment parts scattered all over the companies facilities.
Most of the data collected is measured (field validation is required) or some value the operator can choose from a predefined list.The software then does some calculation and displays back instructions to the operator.
so for example for part number 3435Af-B, engineering requires the operator to measure
the diameter of some bolt and choose the part maker from a list. the application then compares the measured diameter to the stock diameter and tells the operator if it should be replaced (this is obviously not a real world example, but you get the idea)
The problem is there are over 200 known equipment parts and these are pretty old and heterogeneous so the engineering team has a limited idea of what they would like to measure on each part in the future. It doesn't seam reasonable to have the development team write and maintain over 200 different classes but doesn't seem realistic to have the engineers use a complicated system of form builders and BPM like drools (i'm getting sweaty just thinking about it).
We're currently trying to make a poor man's solution that would allow the engineers to build forms with a simple GUI having limited features and outputting the forms to XML files. The few complicated cases not covered by the GUI solution would be custom made by the developers (very few cases). For the calculation part, we would use Java Expression library (JEL), the expressions to evaluate would be generated by the same GUI.
Before we commit to this solution, I was wondering if there is something we could do differently to have a more robust system. I know that some people will consider this too much soft-coding and I agree that it is going to be difficult extracting and treating the data.
im working on a project where have to identify similar patterns between wave files where the frequency differ.
for a example the human voice frequency differs from each other. if i hv to identify if the human crying , shouting laughter of a voice, there should be a pattern between crying voices regardless to the frequency.
so im looking for a algorithm that can identify these elements.
You could start by taking a look at Neural Networks. These type of programs usually deal well with certain inconsistencies in your data. The Neuroph Studio provides you with a quick and relatively easy way to construct your Neural Network.
All you need is a set of data containing whatever you want to match. You can use about 70% of this data to let your Neural Network learn to cluster your data and then, use the remaining 30% to test your Neural Network.
The main issue with Neural Networks is that you need to find a way to encode your data into input vectors. Once you do that, the Neural Network should try and learn to find the differences on its own.
For Image based recognition Principal Component Analysis and it's siblings, like Kernel PCA or Linear Discriminent Analysis are the right thing. PCA is an algorithm which works on any kind of data, so I think also on sound.
I would convert the wav into int-Vectors and run an PCA on it to extract the features.
JMathTools are very good for that...
this i found also...
Hope i can helped you...