The problem:
I'm using UIMA Ruta (v2.3.1) in one of my projects, but now I'm facing a problem:
The memory exceeds explainable sizes, but it can't be figured out, where this problem is located, except for the class org.apache.uima.ruta.rule.RuleElementMatch, that takes up to 50% of memory usage.
I call the JavaAPI of UIMA Ruta in my project, to set up the analysis engine. When I'm sending a text to analyze with around 400kbyte size to this engine, there are around 700MB memory blocked by this process, but without any chance for the GC to free some space.
Ruta project:
The given Ruta rules are built-up with REGEXP-structures, but theoretical they should reduce the amount of memory usage, because there are UNMARKALL-Statements at specific endpoints.
Is someone facing the same situation of high memory consumption or are there any suggested solutions? Using the low memory profile as an advice of uima itself is not possible, due the response time is already at around 30 seconds. Increasing the max memory of the JVM is not an option.
This is probably not an answer, but here are some comments that maybe help.
As the name says, RutaRuleElementMatch stores the matches of rule elements, which is required wihtin one RuleMatch in order to identify the information for the actions. This information can be forgotten after a RuleMatch, but sometimes it is necessary to store it. Mainly, it is stored if the analysis engine is configured for debugging (parameters debug and debugWithMacthes). Then, all rule matches and rule element matches are remembered in order to create the debug annotations later. If there are many matches, this can take a lot of memory in the current implementation.
The debug config is also used in the Java API, e.g., in Ruta.select() or Ruta.matches(). In a smaller amount, the matches are also remembered for the head rules of block statements.
So, if debugging is activated, it should be deactivated in order to reduce the memory usage.
400KB of text is quite a lot, I think. Ruta brings quite some overhead, which is required, but can also be improved/reduced. Right now, until the implementation is improved, there are some best practices in order to handle large document in ruta, i.e. reduce the memory usage.
In your use case, I would switch to a different seeder which creates only annotations you need, and only where you need them, e.g., do you need SPACE and BREAK? Then, I would refactor the rules. The example rule you mentioned in the comments is extremely inefficient, and produces many RuleElementMatches. I rather recommend to use dictionary lookup where possible, e.g. with TRIE. You can also improve such rule by restricting the match condition. In your example, this could be W or the output of some dictionary lookup.
If profiling shows that a lot of memory is used by RutaRuleElementMatch, then this can be caused by the debug config, or by inefficient rules.
If profiling shows that a lot of memory is used by RutaBasic, then it is caused by the size of the document and therefore by the amount of annotations. Reducing the amount of annotations helps, as less coverage information needs to be stored in the internal lists/arrays. UNMARK and UNMARKALL help also but not to an extend one expects, at least in my use cases. There is also the parameter lowMemoryProfile which reduces the memory usage of RutaBasic but also the runtime performance as you mentioned. However, I suppose that your rules can be optimized a lot so that the parameter would be an option again.
I hope this helps.
DISCLAIMER: I am a developer of UIMA Ruta
Related
Regarding the dataflow model of computation, I'm doing a PoC to test a few concepts using apache beam with the direct-runner (and java sdk). I'm having trouble creating a pipeline which reads a "big" csv file (about 1.25GB) and dumping it into an output file without any particular transformation like in the following code (I'm mainly concerned with testing IO bottlenecks using this dataflow/beam model because that's of primary importance for me):
// Example 1 reading and writing to a file
Pipeline pipeline = Pipeline.create();
PCollection<String> output = ipeline
.apply(TextIO.read().from("BIG_CSV_FILE"));
output.apply(
TextIO
.write()
.to("BIG_OUTPUT")
.withSuffix("csv").withNumShards(1));
pipeline.run();
The problem that I'm having is that only smaller files do work, but when the big file is used, no output file is being generated (but also no error/exception is shown either, which makes debugging harder).
I'm aware that on the runners page of the apache-beam project (https://beam.apache.org/documentation/runners/direct/), it is explicitly stated under the memory considerations point:
Local execution is limited by the memory available in your local environment. It is highly recommended that you run your pipeline with
data sets small enough to fit in local memory. You can create a small
in-memory data set using a Create transform, or you can use a Read
transform to work with small local or remote files.
This above suggests I'm having a memory problem (but sadly isn't being explicitly stated on the console, so I'm just left wondering here). I'm also concerned with their suggestion that the dataset should fit into memory (why isn't it reading from the file in parts instead of fitting the whole file/dataset into memory?)
A 2nd consideration I'd like to also add into this conversation would be (in case this is indeed a memory problem): How basic is the implementation of the direct runner? I mean, it isn't hard to implement a piece of code that reads from a big file in chunks, and also outputs to a new file (also in chunks), so that at no point in time the memory usage becomes a problem (because neither file is completely loaded into memory - only the current "chunk"). Even if the "direct-runner" is more of a prototyping runner to test semantics, would it be too much to expect that it should deal nicely with huge files? - considering that this is a unified model built for the ground up to deal with streaming where window size is arbitrary and huge data accumulation/aggregation before sinking it is a standard use-case.
So more than a question I'd deeply appreciate your feedback/comments regarding any of these points: have you notice IO constraints using the direct-runner? Am I overlooking some aspect or is the direct-runner really so naively implemented? Have you verified that by using a proper production runner like flink/spark/google cloud dataflow, this constraint disapears?
I'll eventually test with other runners like the flink or the spark one, but it feels underwhelming that the direct-runner (even if it is intended only for prototyping purposes) is having trouble with this first test I'm running on - considering the whole dataflow idea is based around ingesting, processing, grouping and distributing huge amounts of data under the umbrella of an unified batch/streaming model.
EDIT (to reflect Kenn's feedback):
Kenn, thanks for those valuable points and feedback, they have been of great help in pointing me towards relevant documentation. By your suggestion I've found out by profiling the application that the problem is indeed a java heap related one (that somehow is never shown on the normal console - and only seen on the profiler). Even though the file is "only" 1.25GB in size, internal usage goes beyond 4GB before dumping the heap, suggesting the direct-runner isn't "working by chunks" but is indeed loading everything in memory (as their doc says).
Regarding your points:
1- I believe that serialization and shuffling can very well still be achieved through a "chunk by chunk" implementation. Maybe I had a false expectation of what the direct-runner should be capable of, or I didn't fully grasp its intended reach, for now I'll refrain of doing non-functional type of tests while using the direct-runner.
2 - Regarding sharding. I believe the NumOfShards controls the parallelism (and amount of output files) at the write stage (processing before that should still be fully parallel, and only at the time of writing, will it use as many workers -and generate as many files- as explicitly provided). Two reasons to believe this are: first, the CPU profiler always show 8 busy "direct-runner-workers" -mirroring the amount of logical cores that my PC has-, independently on if I set 1 shard or N shards. The 2nd reason is what I understand from the documentation here (https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/io/WriteFiles.html) :
By default, every bundle in the input PCollection will be processed by
a FileBasedSink.WriteOperation, so the number of output will vary
based on runner behavior, though at least 1 output will always be
produced. The exact parallelism of the write stage can be controlled
using withNumShards(int), typically used to control how many files
are produced or to globally limit the number of workers connecting to
an external service. However, this option can often hurt performance:
it adds an additional GroupByKey to the pipeline.
One interesting thing here is that "additional GroupByKey added to the pipeline" is kind of undesired in my use case (I only desire results in 1 file, without any regard for order or grouping),
so probbly adding an extra "flatten" files step, after having the N sharded output files generated is a better approach.
3 - your suggestion for profiling was spot on, thanks.
Final Edit the direct runner is not intended for performance testing, only prototyping and well formedness of the data. It doen't have any mechanism of spliting and dividing work by partitions, and handles everything in memory
There are a few issues or possibilities. I will answer in priority order.
The direct runner is for testing with very small data. It is engineered for maximum quality assurance, with performance not much of a priority. For example:
it randomly shuffles data to make sure you are not depending on ordering that will not exist in production
it serializes and deserializes data after each step, to make sure the data will be transmitted correctly (production runners will avoid serialization as much as possible)
it checks whether you have mutated elements in forbidden ways, which would cause you data loss in production
The data you are describing is not very big, and the DirectRunner can process it eventually in normal circumstances.
You have specified numShards(1) which explicitly eliminates all parallelism. It will cause all of the data to be combined and processed in a single thread, so it will be slower than it could be, even on the DirectRunner. In general, you will want to avoid artificially limiting parallelism.
If there is any out of memory error or other error preventing processing, you should see a lot message. Otherwise, it will be helpful to look at profiling and CPU utilization to determine if processing is active.
This question has been indirectly answered by Kenn Knowles above. The direct runner is not intended for performance testing, only prototyping and well formedness of the data. It doen't have any mechanism of spliting and dividing work by partitions, and handles every dataset in memory. Performance testing should be carried on by using other runners (like Flink Runner), - those will provide data splitting and the type of infrastructure needed to deal with high IO bottlenecks.
UPDATE: adding to the point adressed by this question, there is a related question here: How to deal with (Apache Beam) high IO bottlenecks?
Whereas the question here revolves around figuring out if the direct runner can deal with huge datasets (which we already established here that it is not possible); the provided link above points to a discussion of weather production runners (like flink/spark/cloud dataflow) can deal natively out of the box with huge datasets (the short answer is yes, but please check yourself on the link for a deeper discussion).
In mongodb docs the author mentions it's a good idea to shorten property names:
Use shorter field names.
and in an old blog post from how to node (it is offline by now April, 2022 edit)
....oft-reported issue with mongoDB is the
size of the data on the disk... each and every record stores all the field-names
.... This means that it can often be
more space-efficient to have properties such as 't', or 'b' rather
than 'title' or 'body', however for fear of confusion I would avoid
this unless truly required!
I am aware of solutions of how to do it. I am more interested in when is this truly required?
To quote Donald Knuth:
Premature optimization is the root of all evil (or at least most of
it) in programming.
Build your application however seems most sensible, maintainable and logical. Then, if you have performance or storage issues, deal with those that have the greatest impact until either performance is satisfactory or the law of diminishing returns means there's no point in optimising further.
If you are uncertain of the impact of particular design decisions (like long property names), create a prototype to test various hypotheses (like "will shorter property names save much space"). Don't expect the outcome of testing to be conclusive, however it may teach you things you didn't expect to learn.
Keep the priority for meaningful names above the priority for short names unless your own situation and testing provides a specific reason to alter those priorities.
As mentioned in the comments of SERVER-863, if you're using MongoDB 3.0+ with the WiredTiger storage option with snappy compression enabled, long field names become even less of an issue as the compression effectively takes care of the shortening for you.
Bottom line up: So keep it as compact as it still stays meaningful.
I don't think that this is every truly required to be shortened to one letter names. Anyway you should shorten them as much as possible, and you feel comfortable with it. Lets say you have a users name: {FirstName, MiddleName, LastName} you may be good to go with even name:{first, middle, last}. If you feel comfortable you may be fine with name:{f, m,l}.
You should use short names: As it will consume disk space, memory and thus may somewhat slowdown your application(less objects to hold in memory, slower lookup times due to bigger size and longer query time as seeking over data takes longer).
A good schema documentation may tell the developer that t stands for town and not for title. Depending on your stack you may even be able to hide the developer from working with these short cuts through some helper utils to map it.
Finally I would say that there's no guideline to when and how much you should shorten your schema names. It highly depends on your environment and requirements. But you're good to keep it compact if you can supply a good documentation explaining everything and/or offering utils to ease the life of developers and admins. Anyway admins are likely to interact directly with mongodb, so I guess a good documentation shouldn't be missed.
I performed a little benchmark, I uploaded 252 rows of data from an Excel into two collections testShortNames and testLongNames as follows:
Long Names:
{
"_id": ObjectId("6007a81ea42c4818e5408e9c"),
"countryNameMaster": "Andorra",
"countryCapitalNameMaster": "Andorra la Vella",
"areaInSquareKilometers": 468,
"countryPopulationNumber": NumberInt("77006"),
"continentAbbreviationCode": "EU",
"currencyNameMaster": "Euro"
}
Short Names:
{
"_id": ObjectId("6007a81fa42c4818e5408e9d"),
"name": "Andorra",
"capital": "Andorra la Vella",
"area": 468,
"pop": NumberInt("77006"),
"continent": "EU",
"currency": "Euro"
}
I then got the stats for each, saved in disk files, then did a "diff" on the two files:
pprint.pprint(db.command("collstats", dbCollectionNameLongNames))
The image below shows two variables of interest: size and storageSize.
My reading showed that storageSize is the amount of disk space used after compression, and basically size is the uncompressed size. So we see the storageSize is identical. Apparently the Wired Tiger engine compresses fieldnames quite well.
I then ran a program to retrieve all data from each collection, and checked the response time.
Even though it was a sub-second query, the long names consistently took about 7 times longer. It of course will take longer to send the longer names across from the database server to the client program.
-------LongNames-------
Server Start DateTime=2021-01-20 08:44:38
Server End DateTime=2021-01-20 08:44:39
StartTimeMs= 606964546 EndTimeM= 606965328
ElapsedTime MilliSeconds= 782
-------ShortNames-------
Server Start DateTime=2021-01-20 08:44:39
Server End DateTime=2021-01-20 08:44:39
StartTimeMs= 606965328 EndTimeM= 606965421
ElapsedTime MilliSeconds= 93
In Python, I just did the following (I had to actually loop through the items to force the reads, otherwise the query returns only the cursor):
results = dbCollectionLongNames.find(query)
for result in results:
pass
Adding my 2 cents on this..
Long named attributes (or, "AbnormallyLongNameAttributes") can be avoided while designing the data model. In my previous organisation we tested keeping short named attributes strategy, such as, organisation defined 4-5 letter encoded strings, eg:
First Name = FSTNM,
Last Name = LSTNM,
Monthly Profit Loss Percentage = MTPCT,
Year on Year Sales Projection = YOYSP, and so on..)
While we observed an improvement in query performance, largely due to the reduction in size of data being transferred over the network, or (since we used JAVA with MongoDB) the reduction in length of "keys" in MongoDB document/Java Map heap space, the overall improvement in performance was less than 15%.
In my personal opinion, this was a micro-optimzation that came at an additional cost (and a huge headache) of maintaining/designing an additional system of managing Data Attribute Dictionary for each of the data models. This system was required to have an organisation wide transparency while debugging the application/answering to client queries.
If you find yourself in a position where upto 20% increase in the performance with this strategy is lucrative to you, may be it is time to scale up your MongoDB servers/choose some other data modelling/querying strategy, or else to choose a different database altogether.
If using verbose xml, trying to ameliorate that with custom names could be very important. A user comment in the SERVER-863 ticket said in his case; I'm ' storing externally-defined XML objects, with verbose naming: the fieldnames are, perhaps, 70% of the total record size. So fieldname tokenization could be a giant win, both in terms of I/O and memory efficiency.'
Collection with smaller name - InsertCompress
Collection with bigger name - InsertNormal
I Performed this on our mongo sharded cluster and Analysis shows
There is around 10-15% gain in shorter names while saving and seems purely based on network latency. I added bulk insert using multiple threads. So if single inserts it can save more.
My avg data size for InsertCompress is 280B and InsertNormal is 350B and inserted 25 million records. So InsertNormal shows 8.1 GB and InsertCompress shows 6.6 GB. This is data size.
Surprisingly Index data size shows as 2.2 GB for InsertCompress collection and 2 GB for InsertNormal collection
Again the storage size is 2.2 GB for InsertCompress collection while InsertNormal its around 1.6 GB
Overall apart from network latency there is nothing gained for storage, so not worth to put efforts going in this direction to save storage. Only if you have much bigger document and smaller field names saves lot of data you can consider
I have been doing a lot of reading on this idea and I have read many different people asking questions about it. I came across: http://javatroops.blogspot.ca/2012/12/sort-very-large-text-file-on-limited.html and I had some questions about it. (Sorry I am unsure on the exact rules on linking blogs and what not).
I understand the basis of external sorting where I am breaking down the file into temporary files and then performing a merge sort on the number of temporary files. However, this method would stress the I/O wouldn't it because of all the files? I could limit the number of files opened at one time by doing multiple passes of n-way merge, but I wished to find improvements.
So reading that blog post above, I understand the idea behind the Solution 2, less so Solution 3 but I do not understand its implementation in Java. In the middle of both solutions, after the file has been split into temporary files, all the files are read into BufferedReaders (mini question about bufferedreaders and filereaders: is it just that it adds buffering and makes I/O faster but doesn't change usability from a programmer's perspective?) and then mapped. Does this process of mapping not take up memory? I am confused on how inserting into a TreeSet would take up memory but putting into a TreeMap wouldn't. How does Solution 3 work exactly?
However, this method would stress the I/O wouldn't it because of all the files? I could limit the number of files opened at one time by doing multiple passes of n-way merge, but I wished to find improvements.
Using memory is always going to be faster, at least 10 - 1000x faster. This is why we use memory and generally try to ensure we have as much as we need. If you are going to use IO heavily, you have to be careful how you access the data as the more you use the IO, the slower you will be.
is it just that it adds buffering and makes I/O faster but doesn't change usability from a programmer's perspective?)
The main advantage is that it support reading lines. (And it does buffering)
Memory mapped files can result in less garbage and overhead depending on how they are used but you are right that it's not very clear.
Memory mapped files use off heap memory and can be more naturally swapped in and out. The problem is that it will use all available free memory transparently. ie. your heap limit doesn't apply so you can pretend you only have 40 MB of heap, but you can't pretend you don't have as much main memory as you do.
How does Solution 3 work exactly?
I suggest you ask the author what he meant exactly.
I'm filling up the JVM Heap Space.
Changing parameters to give more heap space to the JVM, or changing something in my algorithm in the code not to use so much space are two of the most recommended options.
But, if those two have already been tried and applied, and I still get out of memory exceptions, I'd like to see what the other options are.
I found out about this example of "Using a memory mapped file for a huge matrix" and a library called HugeCollections which are an interesting way to solve my problem. Unluckily, the library hasn't seen an update for over a year, and it's not in any Maven repo - so for me it's not a really reliable one.
My question is, is there any other library doing this, or a good way of achieving it (having collection objects (lists and sets) memory mapped)?
You don't say what sort of collections you're using, or the way that you're using them, so it's hard to give recommendations. However, here are a few things to keep in mind:
Keeping the objects on the Java heap will always be the simplest option, and RAM is relatively cheap.
Blindly moving to memory-mapped data is very likely to give horrendous performance, especially if you're moving around in the file and/or making lots of changes. Hash-based collection types are the worst, as they work by distributing data. Tree-based collection types are generally a better choice, and linear collections can go both ways.
Once you move off-heap, you need a way to translate your objects to/from Java. Object serialization is the easiest, but adds lots of overhead. Binary objects accessed via byte buffers are usually a better choice, but you need to be thread-conscious.
You also have to manage your own garbage collection for off-heap objects. Not a problem if all you're doing is creating/updating, but quickly becomes a pain if you're deleting.
If you have a lot of data, and need to access that data in varied ways, a database is probably your best bet.
Unluckily, the library hasn't seen an update for over a year, and it's not in any Maven repo - so for me it's not a really reliable one I agree and I wrote it. ;)
I suggest you look at https://github.com/peter-lawrey/Java-Chronicle which is higher performance has been used a bit. It really design for List & Queue but you could use it for a Map or Set with additional data structures.
Depending on your requirements, you could write your own library. e.g. for time series data I wrote a different library which is not open source unfortunately but can load tables of 500+ GB pretty cleanly.
it's not in any Maven repo
Neither is this one but would be happy for someone to add it.
Sounds like you're either having trouble with a memory leak, or trying to put too large an Object into memory.
Have you tried making a rough estimate of the amount of memory needed to load your data?
Assuming you have no memory leaks or other issues and really need that much storage that you can't fit it in the heap (which I find unlikely) you have basically only one option:
Don't put your data on the heap. Simple as that. Now which method you use to move your data out is very dependend on your requirements (what kind of data, frequency of updates and how much is it really?).
Note: You can use very large heaps with a 64-bit VM and if necessary enlarge the swap space of the OS. It may be the simplest solution to just brutally increase the maximum heap size (even if it means lots of swapping). I certainly would try that first in the situation you outlined.
When writing a Java program, do I have influence on how the CPU will utilize its cache to store my data? For example, if I have an array that is accessed a lot, does it help if it's small enough to fit in one cache line (typically 128 byte on a 64-bit machine)? What if I keep a much used object within that limit, can I expect the memory used by it's members to be close together and staying in cache?
Background: I'm building a compressed digital tree, that's heavily inspired by the Judy arrays, which are in C. While I'm mostly after its node compression techniques, Judy has CPU cache optimization as a central design goal and the node types as well as the heuristics for switching between them are heavily influenced by that. I was wondering if I have any chance of getting those benefits, too?
Edit: The general advice of the answers so far is, don't try to microoptimize machine-level details when you're so far away from the machine as you're in Java. I totally agree, so felt I had to add some (hopefully) clarifying comments, to better explain why I think the question still makes sense. These are below:
There are some things that are just generally easier for computers to handle because of the way they are built. I have seen Java code run noticeably faster on compressed data (from memory), even though the decompression had to use additional CPU cycles. If the data were stored on disk, it's obvious why that is so, but of course in RAM it's the same principle.
Now, computer science has lots to say about what those things are, for example, locality of reference is great in C and I guess it's still great in Java, maybe even more so, if it helps the optimizing runtime to do more clever things. But how you accomplish it might be very different. In C, I might write code that manages larger chunks of memory itself and uses adjacent pointers for related data.
In Java, I can't (and don't want to) know much about how memory is going to be managed by a particular runtime. So I have to take optimizations to a higher level of abstraction, too. My question is basically, how do I do that? For locality of reference, what does "close together" mean at the level of abstraction I'm working on in Java? Same object? Same type? Same array?
In general, I don't think that abstraction layers change the "laws of physics", metaphorically speaking. Doubling your array in size every time you run out of space is a good strategy in Java, too, even though you don't call malloc() anymore.
The key to good performance with Java is to write idiomatic code, rather than trying to outwit the JIT compiler. If you write your code to try to influence it to do things in a certain way at the native instruction level, you are more likely to shoot yourself in the foot.
That isn't to say that common principles like locality of reference don't matter. They do, but I would consider the use of arrays and such to be performance-aware, idiomatic code, but not "tricky."
HotSpot and other optimizing runtimes are extremely clever about how they optimize code for specific processors. (For an example, check out this discussion.) If I were an expert machine language programmer, I'd write machine language, not Java. And if I'm not, it would be unwise to think that I could do a better job of optimizing my code than the experts.
Also, even if you do know the best way to implement something for a particular CPU, the beauty of Java is write-once-run-anywhere. Clever tricks to "optimize" Java code tend to make optimization opportunities harder for the JIT to recognize. Straight-forward code that adheres to common idioms is easier for an optimizer to recognize. So even when you get the best Java code for your testbed, that code might perform horribly on a different architecture, or at best, fail to take advantages of enhancements in future JITs.
If you want good performance, keep it simple. Teams of really smart people are working to make it fast.
If the data you're crunching is primarily or wholly made up of primitives (eg. in numeric problems), I would advise the following.
Allocate a flat structure of fixed size arrays-of-primitives at initialisation-time, and make sure the data therein is periodically compacted/defragmented (0->n where n is the smallest max index possible given your element count), to be iterated over using a for-loop. This is the only way to guarantee contiguous allocation in Java, and compaction further serves to improves locality of reference. Compaction is beneficial, as it reduces the need to iterate over unused elements, reducing the number of conditionals: As the for loop iterates, the termination occurs earlier, and less iteration = less movement through the heap = fewer chances for a cache miss. While compaction creates an overhead in and of itself, this may be done only periodically (with respect to your primary areas of processing) if you so choose.
Even better, you can interleave values in these pre-allocated arrays. For instance, if you are representing spatial transforms for many thousands of entities in 2D space, and are processing the equations of motion for each such, you might have a tight loop like
int axIdx, ayIdx, vxIdx, vyIdx, xIdx, yIdx;
//Acceleration, velocity, and displacement for each
//of x and y totals 6 elements per entity.
for (axIdx = 0; axIdx < array.length; axIdx += 6)
{
ayIdx = axIdx+1;
vxIdx = axIdx+2;
vyIdx = axIdx+3;
xIdx = axIdx+4;
yIdx = axIdx+5;
//velocity1 = velocity0 + acceleration
array[vxIdx] += array[axIdx];
array[vyIdx] += array[ayIdx];
//displacement1 = displacement0 + velocity
array[xIdx] += array[vxIdx];
array[yIdx] += array[vxIdx];
}
This example ignores such issues as rendering of those entities using their associated (x,y)... rendering always requires non-primitives (thus, references/pointers). If you do need such object instances, then you can no longer guarantee locality of reference, and will likely be jumping around all over the heap. So if you can split your code into sections where you have primitive-intensive processing as shown above, then this approach will help you a lot. For games at least, AI, dynamic terrain, and physics can be some of the most processor-intensives aspect, and are all numeric, so this approach can be very beneficial.
If you are down to where an improvement of a few percent makes a difference, use C where you'll get an improvement of 50-100%!
If you think that the ease of use of Java makes it a better language to use, then don't screw it up with questionable optimizations.
The good news is that Java will do a lot of stuff beneath the covers to improve your code at runtime, but it almost certainly won't do the kind of optimizations you're talking about.
If you decide to go with Java, just write your code as clearly as you can, don't take minor optimizations into account at all. (Major ones like using the right collections for the right job, not allocating/freeing objects inside a loop, etc. are still worth while)
So far the advice is pretty strong, in general it's best not to try and outsmart the JIT. But as you say some knowledge about the details is useful sometimes.
Regarding memory layout for objects, Sun's Jvm (now Oracle's) lays objects into memory by type (i.e. doubles and longs first, then ints and floats, then shorts and chars, after that bytes and booleans and finally object references). You can get more details here..
Local variables are usually kept in the stack (that is references and primitive types).
As Nick mentions, the best way to ensure the memory layout in Java is by using primitive arrays. That way you can make sure that data is contiguous in memory. Be careful about array sizes though, GCs have trouble with large arrays. It also has the downside that you have to do some memory management yourself.
On the upside, you can use a Flyweight pattern to get Object-like usability while keeping fast performance.
If you need the extra oomph in performance, generating your own bytecode on the fly helps with some problems, as long as the generated code is executed enough times and your VM's native code cache doesn't get full (which disables the JIT for all practical purposes).
To the best of my knowledge: No. You pretty much have to be writing in machine code to get that level of optimization. With assembly you're a step away because you no longer control where things are stored. With a compiler you're two steps away because you don't even control the details of the generated code. With Java you're three steps away because there's a JVM interpreting your code on the fly.
I don't know of any constructs in Java that let you control things on that level of detail. In theory you could indirectly influence it by how you organize your program and data, but you're so far away that I don't see how you could do it reliably, or even know whether or not it was happening.