Could someone list major tasks that the bytecode verifier has to perform to guarantee correctness of the program? Is there a standard, minimal set of responsibilities defined in JVM specification? I was also wondering whether verifications spans across other phases such as loading and initializing.
This is specified in the JVM Specification: Chapter 4.10. Verification of class Files .
The bulk of the page describes the various aspects of type safety. To check that the program is type-safe the verifier needs to figure out what types of operands reside in the operand stack at each program point, and make sure that they match the type expected by the respective instruction.
Other things it verifies include, but is not limited to the following:
Branches must be within the bounds of the code array for the method.
The targets of all control-flow instructions are each the start of an instruction. In the case of a wide instruction, the wide opcode is considered the start of the instruction, and the opcode giving the operation modified by that wide instruction is not considered to start an instruction. Branches into the middle of an instruction are disallowed.
No instruction can access or modify a local variable at an index greater than or equal to the number of local variables that its method indicates it allocates.
All references to the constant pool must be to an entry of the appropriate type. (For example, the instruction getfield must reference a field.)
The code does not end in the middle of an instruction.
Execution cannot fall off the end of the code.
For each exception handler, the starting and ending point of code protected by the handler must be at the beginning of an instruction or, in the case of the ending point, immediately past the end of the code. The starting point must be before the ending point. The exception handler code must start at a valid instruction, and it must not start at an opcode being modified by the wide instruction.
As a final step the verifier also performs a data-flow analysis, which makes sure that no instruction reference any uninitialized local variables.
Alternatively you might like to give it a look at the Java Language Environment white paper by James Gosling.
The bytecode verifier traverses the bytecodes, constructs the type
state information, and verifies the types of the parameters to all the
bytecode instructions.
The illustration shows the flow of data and control from Java language
source code through the Java compiler, to the class loader and
bytecode verifier and hence on to the Java virtual machine, which
contains the interpreter and runtime system. The important issue is
that the Java class loader and the bytecode verifier make no
assumptions about the primary source of the bytecode stream--the code
may have come from the local system, or it may have travelled halfway
around the planet. The bytecode verifier acts as a sort of gatekeeper:
it ensures that code passed to the Java interpreter is in a fit state
to be executed and can run without fear of breaking the Java
interpreter. Imported code is not allowed to execute by any means
until after it has passed the verifier's tests. Once the verifier is
done, a number of important properties are known:
There are no operand stack overflows or underflows
The types of the parameters of all bytecode instructions are known to always be correct
Object field accesses are known to be legal--private, public, or protected
While all this checking appears excruciatingly detailed, by the time
the bytecode verifier has done its work, the Java interpreter can
proceed, knowing that the code will run securely. Knowing these
properties makes the Java interpreter much faster, because it doesn't
have to check anything. There are no operand type checks and no stack
overflow checks. The interpreter can thus function at full speed
without compromising reliability.
It does the following:
There are no operand stack overflows or underflows
The types of the
parameters of all bytecode instructions are known to always be
correct
Object field accesses are known to be legal--private,
public, or protected
Reference:
http://java.sun.com/docs/white/langenv/Security.doc3.html
Related
Computer resources being RAM, possessing power, and disk space. I am just curious, even though it is more or less by a tiny itty-bitty amount.
It could, in theory, be a hair faster in some cases. In practice, they're equally fast.
Non-static, non-public methods are invoked using the invokevirtual bytecode op. This opcode requires the JVM to dynamically look up the actual's method resolution: if you have a call that's statically compiled to AbstractList::contains, should that resolve to ArrayList::contains, or LinkedList::contains, etc? What's more, the compiler can't just reuse the result of this compilation for next time; what if the next time that myList.contains(val) gets called, it's on a different implementation? So, the compiler has to do at least some amount of checking, roughly per-invocation, for non-private methods.
Private methods can't be overridden, and they're invoked using invokespecial. This opcode is used for various kind of method calls that you can resolve just once, and then never change: constructors, call to super methods, etc. For instance, if I'm in ArrayList::add and I call super.add(value) (which doesn't happen there, but let's pretend it did), then the compiler can know for sure that this refers to AbstractList::add, since a class's super class can't ever change.
So, in very rough terms, an invokevirtual call requires resolving the method and then invoking it, while an invokespecial call doesn't require resolving the method (after the first time it's called -- you have to resolve everything at least once!).
This is covered in the JVM spec, section 5.4.3:
Resolution of the symbolic reference of one occurrence of an invokedynamic instruction does not imply that the same symbolic reference is considered resolved for any other invokedynamic instruction.
For all other instructions above, resolution of the symbolic reference of one occurrence of an instruction does imply that the same symbolic reference is considered resolved for any other non-invokedynamic instruction.
(empahsis in original)
Okay, now for the "but you won't notice the difference" part. The JVM is heavily optimized for virtual calls. It can do things like detecting that a certain site always sees an ArrayList specifically, and so "staticify" the List::add call to actually be ArrayList::add. To do this, it needs to verify that the incoming object really is the expected ArrayList, but that's very cheap; and if some earlier method call has already done that work in this method, it doesn't need to happen again. This is called a monomorphic call site: even though the code is technically polymorphic, in practice the list only has one form.
The JVM optimizes monomorphic call sites, and even bimorphic call sites (for instance, the list is always an ArrayList or a LinkedList, never anything else). Once it sees three forms, it has to use a full polymorphic dispatch, which is slower. But then again, at that point you're comparing apples to oranges: a non-private, polymorphic call to a private call that's monomorphic by definition. It's more fair to compare the two kinds of monomorphic calls (virtual and private), and in that case you'll probably find that the difference is minuscule, if it's even detectible.
I just did a quick JMH benchmark to compare (a) accessing a field directly, (b) accessing it via a public getter and (c) accessing it via a private getter. All three took the same amount of time. Of course, uber-micro benchmarks are very hard to get right, because the JIT can do such wonderful things with optimizations. Then again, that's kind of the point: The JIT does such wonderful things with optimizations that public and private methods are just as fast.
Do private functions use more or less computer resources than public ones?
No. The JVM uses the same resources regardless of the access modifier on individual fields or methods.
But, there is a far better reason to prefer private (or protected) beside resource utilization; namely encapsulation. Also, I highly recommend you read The Developer Insight Series: Part 1 - Write Dumb Code.
I am just curious, even though it is more or less by a tiny itty-bitty amount.
While it is good to be curious ... if you start taking this kind of thing into account when you are programming, then:
you are liable to waste a lot of time looking for micro-optimizations that are not needed,
your code is liable to be unmaintainable because you are sacrificing good design principles, and
you even risk making your code less efficient* than it would be if you didn't optimize.
* - It it can go like this. 1) You spend a lot of time tweaking your code to run fast on your test platform. 2) When you run on the production platform, you find that the hardware gives you different performance characteristics. 3) You upgrade the Java installation, and the new JVM's JIT compiler optimizes your code differently, or it has a bunch of new optimizations that are inhibited by your tweaks. 4) When you run your code on real-world workloads, you discover that the assumption that were the basis for your tweaking are invalid.
I am reading about dynamic dispatch, as I have an exam tomorrow.
In C++ we have conforming subclasses, so through the static type of the identifier we know what index to access in the virtual method table of the runtime object.
From what I am reading, Java has conformance for subclasses as well, but instead of including the known index of a method in the virtual method table in the compiled code, it only includes a symbolic reference to the method, that needs to be resolved.
What is the point of this if the static type does not refer to an interface? It could be much faster to do it the C++ way.
The Java platform defines linkage as a step taken at runtime. Virtual method tables aren't even involved in the JVM specification; they are just a typical way to implement linkage.
Note, however, that after the symbolic reference is resolved into a direct reference, there is nothing stopping the runtime from using very fast code paths for method invocation sites. That includes special-case optimizations such as monomorphic call sites, which have a hardwired direct pointer to the method code and are thus faster than vtable lookups. Monomorphic sites then become an easy target for method inlining, which opens a whole new field of applicable optimizations. Another option is an n-polymorphic site, accommodating up to n different target types in an inline cache.
As opposed to C++, all these optimizing decisions happen at runtime, subject to the specific conditions at work: the exact set of loaded classes, profiling data for each individual call site, etc. This gives managed-runtime platforms such as Java advantages of their own.
This question already has answers here:
How does bytecode get verified in the JVM?
(2 answers)
Closed 8 years ago.
So I am a little confused regarding the verification of bytecode that happens inside a JVM. According to the book by Deitel and Deitel, a Java program goes through five phases (edit, compile, load, verify and execute) (chapter 1). The bytecode verifier verifies the bytecode during the 'verify' stage. Nowhere does the book mention that the bytecode verifier is a part of the classloader.
However according to
docs of oracle
, the classloader performs the task of loading, linking and initialization, and during the process of linking it has to verify the bytecode.
Now, are the bytecode verification that Deitel and Deitel talks about, and the bytecode verification that
this oracle document
talks about, the same process?
Or does bytecode verification happen twice, once during the linking process and the other by the bytecode verifier?
Picture describing phases of a java program as mentioned in book by Dietel and Dietel.(I borrowed this pic from one of the answers below by nobalG :) )
You may understand the byte code verification using this diagram which is in detail explained in Oracle docs
You will find that the byte code verification happens only once not twice
The illustration shows the flow of data and control from Java language
source code through the Java compiler, to the class loader and
bytecode verifier and hence on to the Java virtual machine, which
contains the interpreter and runtime system. The important issue is
that the Java class loader and the bytecode verifier make no
assumptions about the primary source of the bytecode stream--the code
may have come from the local system, or it may have travelled halfway
around the planet. The bytecode verifier acts as a sort of gatekeeper:
it ensures that code passed to the Java interpreter is in a fit state
to be executed and can run without fear of breaking the Java
interpreter. Imported code is not allowed to execute by any means
until after it has passed the verifier's tests. Once the verifier is
done, a number of important properties are known:
There are no operand stack overflows or underflows
The types of the parameters of all bytecode instructions are known to always be correct
Object field accesses are known to be legal--private, public, or protected
While all this checking appears excruciatingly detailed, by the time
the bytecode verifier has done its work, the Java interpreter can
proceed, knowing that the code will run securely. Knowing these
properties makes the Java interpreter much faster, because it doesn't
have to check anything. There are no operand type checks and no stack
overflow checks. The interpreter can thus function at full speed
without compromising reliability.
EDIT:-
From Oracle Docs Section 5.3.2:
When the loadClass method of the class loader L is invoked with the
name N of a class or interface C to be loaded, L must perform one of
the following two operations in order to load C:
The class loader L can create an array of bytes representing C as the bytes of a ClassFile structure (§4.1); it then must invoke the
method defineClass of class ClassLoader. Invoking defineClass
causes the Java Virtual Machine to derive a class or interface
denoted by N using L from the array of bytes using the algorithm
found in §5.3.5.
The class loader L can delegate the loading of C to some other class loader L'. This is accomplished by passing the argument N
directly or indirectly to an invocation of a method on L'
(typically the loadClass method). The result of the invocation is
C.
As correctly commented by Holger, trying to explain it more with the help of an example:
static int factorial(int n)
{
int res;
for (res = 1; n > 0; n--) res = res * n;
return res;
}
The corresponding byte code would be
method static int factorial(int), 2 registers, 2 stack slots
0: iconst_1 // push the integer constant 1
1: istore_1 // store it in register 1 (the res variable)
2: iload_0 // push register 0 (the n parameter)
3: ifle 14 // if negative or null, go to PC 14
6: iload_1 // push register 1 (res)
7: iload_0 // push register 0 (n)
8: imul // multiply the two integers at top of stack
9: istore_1 // pop result and store it in register 1
10: iinc 0, -1 // decrement register 0 (n) by 1
11: goto 2 // go to PC 2
14: iload_1 // load register 1 (res)
15: ireturn // return its value to caller
Note that most of the instructions in JVM are typed.
Now you should note that proper operation of the JVM is not guaranteed unless the code meets at least the following conditions:
Type correctness: the arguments of an instruction are always of the
types expected by the instruction.
No stack overflow or underflow: an instruction never pops an argument
off an empty stack, nor pushes a result on a full stack (whose size is
equal to the maximal stack size declared for the method).
Code containment: the program counter must always point within the
code for the method, to the beginning of a valid instruction encoding
(no falling off the end of the method code; no branches into the
middle of an instruction encoding).
Register initialization: a load from a register must always follow at
least one store in this register; in other terms, registers that do
not correspond to method parameters are not initialized on method
entrance, and it is an error to load from an uninitialized register.
Object initialization: when an instance of a class C is created, one
of the initialization methods for class C (corresponding to the
constructors for this class) must be invoked before the class
instance can be used.
The purpose of byte code verification is to check these condition once and for all, by static analysis of the byte code at load time. Byte code that passes verfification can then be executed faster.
Also to note that byte code verification purpose is to shift the verfification listed above from run time to load time.
The above explanation has been taken from Java bytecode verification: algorithms and formalizations
No.
From the JVM Spec 4.10:
Even though a compiler for the Java programming language must only produce class files that satisfy all the static and structural constraints in the previous sections, the Java Virtual Machine has no guarantee that any file it is asked to load was generated by that compiler or is properly formed.
And then proceeds specify the verification process.
And JVM Spec 5.4.1:
Verification (§4.10) ensures that the binary representation of a class or interface is structurally correct (§4.9). Verification may cause additional classes and interfaces to be loaded (§5.3) but need not cause them to be verified or prepared.
The section specifying linking references §4.10 - not as a separate process but part of loading the classes.
The JVM and JLS are great documents when you have a question like this.
No such Two time verification
NO, As far as verification is concerned,look closely that how the program written in java goes through various phases in the following image,You will see that there is no such Two time verification but the code is verified just once.
EDIT – The programmer writes the program (preferably on a notepad)
and saves it as a ‘.java’ file, which is then further used for
compilation, by the compiler.
COMPILE – The compiler here takes the ‘.java’ file, compiles it
and looks for any possible errors in the scope of the program. If
it finds any error, it reports them to the programmer. If no error
is there, then the program is converted into the bytecode and
saved as a ‘.class’ file.
LOAD – Now the major purpose of the component called ‘Class Loader’
is to load the byte code in the JVM. It doesn’t execute the code yet,
but just loads it into the JVM’s memory.
VERIFY – After loading the code, the JVM’s subpart called ‘Byte
Code verifier’ checks the bytecode and verifies it for its
authenticity. It also checks if the bytecode has any such code
which might lead to some malicious outcome. This component of the
JVM ensures security.
EXECUTE – The next component is the Execution Engine. The execution
engine interprets the code line by line using the Just In Time (JIT)
compiler. The JIT compiler does the execution pretty fast but
consumes extra cache memory.
The spec lists 4 phases in bytecode verification. These steps are functionally distinct, not to be mistaken with repeating the same thing. Just like a multi-pass compiler uses each pass to setup for the next pass, phases are not repetition, but are orchestrated for a single overall purpose, each phase accomplishes certain tasks.
Unless the bytecode is changed, there is no reason to verify it twice.
The verification is described here.
http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.10
Verification of code happens twice. Once during compilation (compilation fails if the code has flaws, threats) and again after the class is loaded into memory during execution (actual byte-code verification happens here). Yes, this happens along with the process of loading classes (by class loaders), but the class loaders themselves might not act as verifiers. Its the JVM (or rather the verifier present in the JVM) that does the verification.
I've recently been looking at The Java Virtual Machine Specifications (JVMS) to try to better understand the what makes my programs work, but I've found a section that I'm not quite getting...
Section 4.7.4 describes the StackMapTable Attribute, and in that section the document goes into details about stack map frames. The issue is that it's a little wordy and I learn best by example; not by reading.
I understand that the first stack map frame is derived from the method descriptor, but I don't understand how (which is supposedly explained here.) Also, I don't entirely understand what the stack map frames do. I would assume they're similar to blocks in Java, but it appears as though you can't have stack map frames inside each other.
Anyway, I have two specific questions:
What do the stack map frames do?
How is the first stack map frame created?
and one general question:
Can someone provide an explanation less wordy and easier to understand than the one given in the JVMS?
Java requires all classes that are loaded to be verified, in order to maintain the security of the sandbox and ensure that the code is safe to optimize. Note that this is done on the bytecode level, so the verification does not verify invariants of the Java language, it merely verifies that the bytecode makes sense according to the rules for bytecode.
Among other things, bytecode verification makes sure that instructions are well formed, that all the jumps are to valid instructions within the method, and that all instructions operate on values of the correct type. The last one is where the stack map comes in.
The thing is that bytecode by itself contains no explicit type information. Types are determined implicitly through dataflow analysis. For example, an iconst instruction creates an integer value. If you store it in slot 1, that slot now has an int. If control flow merges from code which stores a float there instead, the slot is now considered to have invalid type, meaning that you can't do anything more with that value until overwriting it.
Historically, the bytecode verifier inferred all the types using these dataflow rules. Unfortunately, it is impossible to infer all the types in a single linear pass through the bytecode because a backwards jump might invalidate already inferred types. The classic verifier solved this by iterating through the code until everything stopped changing, potentially requiring multiple passes.
However, verification makes class loading slow in Java. Oracle decided to solve this issue by adding a new, faster verifier, that can verify bytecode in a single pass. To do this, they required all new classes starting in Java 7 (with Java 6 in a transitional state) to carry metadata about their types, so that the bytecode can be verified in a single pass. Since the bytecode format itself can't be changed, this type information is stored seperately in an attribute called StackMapTable.
Simply storing the type for every single value at every single point in the code would obviously take up a lot of space and be very wasteful. In order to make the metadata smaller and more efficient, they decided to have it only list the types at positions which are targets of jumps. If you think about it, this is the only time you need the extra information to do a single pass verification. In between jump targets, all control flow is linear, so you can infer the types at in between positions using the old inference rules.
Each position where types are explicitly listed is known as a stack map frame. The StackMapTable attribute contains a list of frames in order, though they are usually expressed as a difference from the previous frame in order to reduce data size. If there are no frames in the method, which occurs when control flow never joins (i.e. the CFG is a tree), then the StackMapTable attribute can be omitted entirely.
So this is the basic idea of how StackMapTable works and why it was added. The last question is how the implicit initial frame is created. The answer of course is that at the beginning of the method, the operand stack is empty and the local variable slots have the types given by the types of the method parameters, which are determined from the method decriptor.
If you're used to Java, there are a few minor differences to how method parameter types work at the bytecode level. First off, virtual methods have an implicit this as first parameter. Second, boolean, byte, char, and short do not exist at the bytecode level. Instead, they are all implemented as ints behind the scenes.
I am little bit curious about that what happen if I manually changed something into bytecode before execution. For instance, let suppose assigning int type variable into byte type variable without casting or remove semicolon from somewhere in program or anything that leads to compile time error. As I know all compile time errors are checked by compiler before making .class file. So what happen when I changed byte code after successfully compile a program then changed bytecode manually ? Is there any mechanism to handle this ? or if not then how program behaves after execution ?
EDIT :-
As Hot Licks, Darksonn and manouti already gave correct satisfy answers.Now I just conclude for those readers who all seeking answer for this type question :-
Every Java virtual machine has a class-file verifier, which ensures that loaded class files have a proper internal structure. If the class-file verifier discovers a problem with a class file, it throws an exception. Because a class file is just a sequence of binary data, a virtual machine can't know whether a particular class file was generated by a well-meaning Java compiler or by shady crackers bent on compromising the integrity of the virtual machine. As a consequence, all JVM implementations have a class-file verifier that can be invoked on untrusted classes, to make sure the classes are safe to use.
Refer this for more details.
You certainly can use a hex editor (eg, the free "HDD Hex Editor Neo") or some other tool to modify the bytes of a Java .class file. But obviously, you must do so in a way that maintains the file's "integrity" (tables all in correct format, etc). Furthermore (and much trickier), any modification you make must pass muster by the JVM's "verifier", which essentially rechecks everything that javac verified while compiling the program.
The verification process occurs during class loading and is quite complex. Basically, a data flow analysis is done on each procedure to assure that only the correct data types can "reach" a point where the data type is assumed. Eg, you can't change a load operation to load a reference to a HashMap onto the "stack" when the eventual user of the loaded reference will be assuming it's a String. (But enumerating all the checks the verifier does would be a major task in itself. I can't remember half of them, even though I wrote the verifier for the IBM iSeries JVM.)
(If you're asking if one can "jailbreak" a Java .class file to introduce code that does unauthorized things, the answer is no.)
You will most likely get a java.lang.VerifyError:
Thrown when the "verifier" detects that a class file, though well formed, contains some sort of internal inconsistency or security problem.
You can certainly do this, and there are even tools to make it easier, like http://set.ee/jbe/. The Java runtime will run your modified bytecode just as it would run the bytecode emitted by the compiler. What you're describing is a Java-specific case of a binary patch.
The semicolon example wouldn't be an issue, since semicolons are only for the convenience of the compiler and don't appear in the bytecode.
Either the bytecode executes normally and performs the instructions given or the jvm rejects them.
I played around with programming directly in java bytecode some time ago using jasmin, and I noticed some things.
If the bytecode you edited it into makes sense, it will of coursse run as expected. However there are some bytecode patterns that are rejected with a VerifyError.
For the specific examble of out of bounds access, you can compile code with out of bounds just fine. They will get you an ArrayIndexOutOfBoundsException at runtime.
int[] arr = new int[20];
for (int i = 0; i < 100; i++) {
arr[i] = i;
}
However you can construct bytecode that is more fundamentally flawed than that. To give an example I'll explain some things first.
The java bytecode works with a stack, and instructions works with the top elements on the stack.
The stack naturally have different sizes at different places in the program but sometimes you might use a goto in the bytecode to cause the stack to look different depending on how you reached there.
The stack might contain object, int then you store the object in an object array and the int in an int array. Then you go on and from somewhere else in that bytecode you use a goto, but now your stack contains int, object which would result in an int being passed to an object array and vice versa.
This is just one example of things that could happen which makes your bytecode fundamentally flawed. The JVM detects these kinds of flaws when the class is loaded at runtime, and then emits a VerifyError if something dosen't work.