default number system and charecter set in java - java

Thi is a fundamental questuion about how java works and so i dont have any code to support it.
I am new to java development and want to know how the different number systems, charecter sets like UTF 8 and unicode come together in Java.
Lets say a user creates a new string and int with the same value.
int i=100;
String S="100";
The hardware of a computer understands zeros and ones. so it has to be converted to binary?(correct me if im wrong). this conversion should be done by the JVM(correct me if im wrong)? and to represent charecters of different languages into charecters that can be typed into the keyboard (english) UTF-8 and such conversions are used(correction needed)?
now how does this whole flow fit into the bigger picture of running a java web application?
how does a string/int get converted to a binary for the machine's hardware to understand?
how does it get converted to UTF-8 for a browser to understand?
and what are the default number format and charecterset in java? if im reading contents of a file? will they be read into binary or utf-8?

All computers run in binary. The conversion is done by the JVM and the computer that you have. You shouldn't worry about converting the code into the coordinating 1's and 0's. The browser has its own conversion hard code to change the universal 1's and 0's(used by all programs and computer software) into however it decides to display the given information. All languages are just a translation guide for the user to "speak" with the computer. And vice versa. Hope this helps though I don't think I really answered anything.

How java represents any data type in memory is the choice of the actual JVM. In practice, the JVM will chose the format native to the processor (e.g. chose between little/big endian for int), simply because it offers the best performance on that platform.
Basically, the JLS makes certain guarantees (like that a byte has 8 bits and the values range from -128 to 127) - the VM just maps that to the platform as it deems suitable (the JLS was specified to match common computing technology closely, so there is usually no magic needed to guess how primitive types map to the platform).
You should never care how the VM represents data in memory, java does not offer any legal way to access the data in a manner where you would need to know (bypassing most of the VM's logic by using sun.misc.Unsafe is not considered legal).
If you care for educational purposes, learn what binary representations the underlying platform (e.g. x86) uses and take a look at the VM. It has little to do with java really, its all VM and platform specific.
For java.lang.String, its the implementation of the class that defines how the String is stored internally - it went through quite some changes over major java versions - but what that String exposes is quite narrowly defined (see JDK javadoc for String.length(), String.charAt()).
As for how user input is translated to java standard types, thats actually platform specific. The JVM selects the default encoding (e.g. String.toBytes() can return quite different results for the same string, depending on the platform - thats why its recommended to explictly specify the desired encoding). Same goes for many other things (time zone, number format etc.).
CharSets and Formats are building blocks the program wires up to translate data from the outside world (file, http or user input) into javas representation of data (or vice versa). For example, a Web application will use the encoding from a HTTP header to determine what CharSet to use when interpreting the contents (the HTTP headers encoding is defined to be US-ASCII by the spec).

Related

Google ProtoBuf serialization / deserialization

I am reading Google Protocol Buffers. I want to know Can I Serialize C++ object and send it on the wire to Java server and Deserialize there in java and introspect the fields.
Rather I want to send objects from any language to Java Server. and deserialize it there.
Assume following is my .proto file
message Person {
required int32 id = 1;
required string name = 2;
optional string email = 3;
}
I ran protoc on this and created a C++ object.
Basically Now i want to send the serialized stream to java server.
Here on java side can I deserialized the stream , so that I can find out there are 3 fields in the stream and its respective name, type, and value
Here on java side can I deserialized the stream , so that I can find out there are 3 fields in the stream and its respective name, type, and value
You will need to know the schema in advance. Firstly, protobuf does not transmit names; all it uses as identifiers is the numeric key (1, 2 and 3 in your example) of each field. Secondly, it does not explicitly specify the type; there are only a very few wire-types in protobuf (varint, 32-bit, 64-bit, length-prefix, group); actual data types are mapped onto those, but you cannot unambiguously decode data without the schema
varint is "some form of integer", but could be signed, unsigned or "zigzag" (which allows negative numbers of small magnitude to be cheaply encoded), and could be intended to represent any width of data (64 bit, 32 bit, etc)
32-bit could be an integer, but could be signed or unsigned - or it could be a 32-bit floating-point number
64-bit could be an integer, but could be signed or unsigned - or it could be a 64-bit floating-point number
length-prefix could be a UTF-8 string, a sequence or raw bytes (without any particular meaning), a "packed" set of repeated values of some primitive type (integer, floating point, etc), or could be a structured sub-message in protobuf format
groups - hoorah! this is always unambigous! this can only mean one thing; but that one thing is largely deprecated by google :(
So fundamentally: you need the schema. The encoded data does not include what you want. It does this to avoid unnecessary space - if the protocol assumes that the encoder and decoder both know what the message is meant to look like, then a lot less information needs to be sent.
Note, however, that the information that is included is enough to safely round-trip a message even if there are fields that are not expected; it is not necessary to know the name or type if you only need to re-encode it to pass it along / back.
What you can do is use the parser API to scan over the data to reveal that there are three fields, field 1 is a varint, field 2 is length-prefixed, field 3 is length-prefixed. You could make educated guesses about the data beyond that (for example, you could see whether a UTF-8 decode produces something that looks roughly text-like, and verify that UTF-8 encoding that gives you back the original bytes; if it does, it is possible it is a string)
Can I Serialize C++ object and send it on the wire to Java server and Deserialize there in java and introspect the fields.
Yes, it is the very goal of protobuf.
Serialize data in an application developed in any supported language, and deserialize data in an application developed in any supported language. Serialization and deserialization languages can be the same, or be different.
Keep in mind that protocol buffers are not self describing, so both sides of your application needs to have serializers/deserializers generated from the .proto file.
In short: yes you can.
You will need to create .proto files which define the data structures that you want to share. By using the Google Protocol Buffers compiler you can then generate interfaces and (de)serialization code for your structures for both Java and C++ (and almost any other language you can think of).
To transfer your data over the wire you can use for instance ZeroMQ which is an extremely versatile communications framework which also sports a slew of different language API's, among them Java and C++.
See this question for more details.

filename encoding using java

I want to write a reversible Encoder along with the corresponding Decoder, so that any string may be encoded to a legal file name corresponding to file naming rules of the Unix file system.
How to achieve this?
Example:
"xyz.txt" would be a valid file name, while "xyz/.txt" would not.
tl;dr: Your approach is flawed. Stick with the limitations of the file system. They're pretty hard to gracefully overcome (especially without introducing your own, even weirder limitations).
It's not possible to make one that is strictly decodable. You're trying to map a larger domain to a smaller domain which means that the reverse mapping cannot be known-correctly reversible.
This is easy to demonstrate with a simple example: how do you encode / such that it can be reversed? "Easy," you might say, "I'll just replace with the token x." But now how do you know when an x is an actual x and when your x is a 'special' x that should be converted to /? You can't.
You can of course make a system that is very unlikely to have any accidental clashes. For example, rather than changing / to - (which would be very error prone), you could change it to ---.
Oh, also, for what it's worth, most unix file systems actually consider any characters other than / or a null char a valid character (more). Obviously using that is a pain in the ass though.

Why to avoid using ByteStream much in Java

We shouldn't use byte Stream as Sun Doc says -
actually it represents a kind of low-level I/O that you should avoid.
What is actually low-level I/O and what is exact problem using byte stream.
So the Java docs say:
CopyBytes seems like a normal program, but it actually represents a
kind of low-level I/O that you should avoid. Since xanadu.txt contains
character data, the best approach is to use character streams, as
discussed in the next section. There are also streams for more
complicated data types. Byte streams should only be used for the most
primitive I/O.
The byte streams give you access to the file as it is. Just the bytes. No interpration of any kind. That means no character set conversion, no handling of ints or floats in binary or ascii representation, no dealing with byte orders, or any of that. The higher level streams provide some of these.
Of course a program that copies a file is actually a pretty good example of something that needs a raw byte stream, because it doesn't need or want to do any kind of intepretation of the data; it just wants to copy it verbatim.
So what the really mean is, use byte streams if you think you need them, but be sure you know what you are doing :)
The suggestion is in the context of reading a text file that is discussed in the tutorial. For that purpose it is better to use character streams to handle character set translation properly:
The Java platform stores character values using Unicode conventions.
Character stream I/O automatically translates this internal format to
and from the local character set.
A program that uses character streams in place of byte streams
automatically adapts to the local character set and is ready for
internationalization — all without extra effort by the programmer.

Is there any difference between Java byte code and .NET byte code? If so, shall I take hexadecimal of that values?

I would like to know if there any difference between Java byte code and .NET byte code? If there any difference, shall I take hexadecimal values of that Java byte code and .Net byte code. Because, hexadecimal is independent of languages and it is universal specification.
Problem description
We are developing a mobile application in j2me and Java. Here I am using external finger print reader for reading/verifying finger print. We are using one Java api for reading/verifying finger print.
I capture the finger template and raw image bytes. I convert the raw image bytes into hex form and stored in a separate text file.
Here we using a conversion tool (developed in .NET) that converts the hex form into image. With the help of that tool we are trying to get the image from that text file. But we cannot get the image correctly.
The .NET programmer says the Java byte and .NET byte differ. Java byte ranges from -128 to 127. But .NET byte ranges from 0 to 255. So there is a problem.
But my assumption here is: the hex is independent of Java & .net. It is common to both. So, instead of storing byte code in text file, I plan to convert that byte code into hexadecimal format. So,our .NET conversion tool automatically convert this hexadecimal into Image.
I don't know whether I am going on correct path or not?
Hexadecimal is just a way to represent numbers.
Java is compiled to bytecode and executed by a JVM.
.NET is compiled to bytecode and executed by the CLR.
The two formats are completely incompatible.
I capture the finger template and raw image bytes .I convert the raw image bytes into hex form and stored in a separate text file.
OK; note, storing as binary would have been easier (and more efficient), but that should work
Here we using a conversion tool (developed in .NET) that converts the hex form into image.With the help of that tool we are trying to get the image from that text file.But we cannot get the image correctly.
Rather than worrying about the image, the first thing to do is check where the problem is; there are two obvious scenarios:
you aren't reading the data back into the same bytes
you have the right bytes, but you can't get it to load as an image
First; figure out which of those it is, simply by storing some known data and attempting to read it back at the other end.
The .NET programmer says the java byte and .NET byte differ.Java byte ranges from -128 to 127.But .NET byte ranges from 0 to 255.So there is a problem.
That shouldn't be a problem for any well-written hex-encode. I would expect a single java byte to correctly write a single hex value between 00 and FF.
I dont know, whether i am going on Correct path or not?
Personally, I suspect you are misunderstanding the problem, which makes it likely that the solution is off the mark. If you want to make life easier, store as binary rather than text; but there is no inherent problem exchanging hex around. If I had to pack raw binary data into a text file, personally I'd probably go for base-64 rather than hex (it will be shorter), but either is fine.
As I mentioned above: first figure out whether the problem is in reading the bytes, vs processing the bytes into an image. I'm also making the assumption that the bytes here are an image format that both environments can process, and not (for example) a custom serialization format.
Yes, Java byte code and .NET’s byte code are two different things that are not interchangeable. As to the second part of your question, I have no idea what you are talking about.
Yes they are different while there are tools that can migrate from one to an other.
Search google fro java bytecode IL comparison . This one from same search

Developing a (file) exchange format for java

I want to come up with a binary format for passing data between application instances in a form of POFs (Plain Old Files ;)).
Prerequisites:
should be cross-platform
information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
only sequential access is required
should be a way to check data consistency
should be small and fast
should prevent an average user with archiver + notepad from modifying the data
Currently I'm using DeflaterOutputStream + OutputStreamWriter together with InflaterInputStream + InputStreamReader to save/restore objects serialized with XStream, one object per file. Readers/Writers use UTF8.
Now, need to extend this to support the previously described.
My idea of format:
{serialized to XML object}
{delimiter}
{String file name}{delimiter}{byte[] file data}
{delimiter}
{another String file name}{delimiter}{another byte[] file data}
...
{delimiter}
{delimiter}
{MD5 hash for the entire file}
Does this look sane?
What would you use for a delimiter and how would you determine it?
The right way to calculate MD5 in this case?
What would you suggest to read on the subject?
TIA.
It looks INsane.
why invent a new file format?
why try to prevent only stupid users from changing file?
why use a binary format ( hard to compress ) ?
why use a format that cannot be parsed while being received? (receiver has to receive entire file before being able to act on the file. )
XML is already a serialization format that is compressable. So you are serializing a serialized format.
Would serialization of the model (if you are into MVC) not be another way? I'd prefer to use things in the language (or standard libraries) rather then roll my own if possible. The only issue I can see with that is that the file size may be larger than you want.
1) Does this look sane?
It looks fairly sane. However, if you are going to invent your own format rather than just using Java serialization then you should have a good reason. Do you have any good reasons (they do exist in some cases)? One of the standard reasons for using XStream is to make the result human readable, which a binary format immediately loses. Do you have a good reason for a binary format rather than a human readable one? See this question for why human readable is good (and bad).
Wouldn't it be easier just to put everything in a signed jar. There are already standard Java libraries and tools to do this, and you get compression and verification provided.
2) What would you use for a delimiter and how determine it?
Rather than a delimiter I'd explicitly store the length of each block before the block. It's just as easy, and prevents you having to escape the delimiter if it comes up on its own.
3) The right way to calculate MD5 in this case?
There is example code here which looks sensible.
4) What would you suggest to read on the subject?
On the subject of serialization? I'd read about the Java serialization, JSON, and XStream serialization so I understood the pros and cons of each, especially the benefits of human readable files. I'd also look at a classic file format, for example from Microsoft, to understand possible design decisions from back in the days that every byte mattered, and how these have been extended. For example: The WAV file format.
Let's see this should be pretty straightforward.
Prerequisites:
0. should be cross-platform
1. information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
2. only sequential access is required
3. should be a way to check data consistency
4. should be small and fast
5. should prevent an average user with archiver + notepad from modifying the data
Well guess what, you pretty much have it already, it's built-in the platform already:Object Serialization
If you need to reduce the amount of data sent in the wire and provide a custom serialization ( for instance you can sent only 1,2,3 for a given object without using the attribute name or nothing similar, and read them in the same sequence, ) you can use this somehow "Hidden feature"
If you really need it in "text plain" you can also encode it, it takes almost the same amount of bytes.
For instance this bean:
import java.io.*;
public class SimpleBean implements Serializable {
private String website = "http://stackoverflow.com";
public String toString() {
return website;
}
}
Could be represented like this:
rO0ABXNyAApTaW1wbGVCZWFuPB4W2ZRCqRICAAFMAAd3ZWJzaXRldAASTGphdmEvbGFuZy9TdHJpbmc7eHB0ABhodHRwOi8vc3RhY2tvdmVyZmxvdy5jb20=
See this answer
Additionally, if you need a sounded protocol you can also check to Protobuf, Google's internal exchange format.
You could use a zip (rar / 7z / tar.gz / ...) library. Many exists, most are well tested and it'll likely save you some time.
Possibly not as much fun though.
I agree in that it doesn't really sound like you need a new format, or a binary one.
If you truly want a binary format, why not consider one of these first:
Binary XML (fast infoset, Bnux)
Hessian
google packet buffers
But besides that, many textual formats should work just fine (or perhaps better) too; easier to debug, extensive tool support, compresses to about same size as binary (binary compresses poorly, and information theory suggests that for same effective information, same compression rate is achieved -- and this has been true in my testing).
So perhaps also consider:
Json works well; binary support via base64 (with, say, http://jackson.codehaus.org/)
XML not too bad either; efficient streaming parsers, some with base64 support (http://woodstox.codehaus.org/, "typed access API" under 'org.codehaus.stax2.typed.TypedXMLStreamReader').
So it kind of sounds like you just want to build something of your own. Nothing wrong with that, as a hobby, but if so you need to consider it as such.
It likely is not a requirement for the system you are building.
Perhaps you could explain how this is better than using an existing file format such as JAR.
Most standard files formats of this type just use CRC as its faster to calculate. MD5 is more appropriate if you want to prevent deliberate modification.
Bencode could be the way to go.
Here's an excellent implementation by Daniel Spiewak.
Unfortunately, bencode spec doesn't support utf8 which is a showstopper for me.
Might come to this later but currently xml seems like a better choice (with blobs serialized as a Map).

Categories