filename encoding using java - java

I want to write a reversible Encoder along with the corresponding Decoder, so that any string may be encoded to a legal file name corresponding to file naming rules of the Unix file system.
How to achieve this?
Example:
"xyz.txt" would be a valid file name, while "xyz/.txt" would not.

tl;dr: Your approach is flawed. Stick with the limitations of the file system. They're pretty hard to gracefully overcome (especially without introducing your own, even weirder limitations).
It's not possible to make one that is strictly decodable. You're trying to map a larger domain to a smaller domain which means that the reverse mapping cannot be known-correctly reversible.
This is easy to demonstrate with a simple example: how do you encode / such that it can be reversed? "Easy," you might say, "I'll just replace with the token x." But now how do you know when an x is an actual x and when your x is a 'special' x that should be converted to /? You can't.
You can of course make a system that is very unlikely to have any accidental clashes. For example, rather than changing / to - (which would be very error prone), you could change it to ---.
Oh, also, for what it's worth, most unix file systems actually consider any characters other than / or a null char a valid character (more). Obviously using that is a pain in the ass though.

Related

default number system and charecter set in java

Thi is a fundamental questuion about how java works and so i dont have any code to support it.
I am new to java development and want to know how the different number systems, charecter sets like UTF 8 and unicode come together in Java.
Lets say a user creates a new string and int with the same value.
int i=100;
String S="100";
The hardware of a computer understands zeros and ones. so it has to be converted to binary?(correct me if im wrong). this conversion should be done by the JVM(correct me if im wrong)? and to represent charecters of different languages into charecters that can be typed into the keyboard (english) UTF-8 and such conversions are used(correction needed)?
now how does this whole flow fit into the bigger picture of running a java web application?
how does a string/int get converted to a binary for the machine's hardware to understand?
how does it get converted to UTF-8 for a browser to understand?
and what are the default number format and charecterset in java? if im reading contents of a file? will they be read into binary or utf-8?
All computers run in binary. The conversion is done by the JVM and the computer that you have. You shouldn't worry about converting the code into the coordinating 1's and 0's. The browser has its own conversion hard code to change the universal 1's and 0's(used by all programs and computer software) into however it decides to display the given information. All languages are just a translation guide for the user to "speak" with the computer. And vice versa. Hope this helps though I don't think I really answered anything.
How java represents any data type in memory is the choice of the actual JVM. In practice, the JVM will chose the format native to the processor (e.g. chose between little/big endian for int), simply because it offers the best performance on that platform.
Basically, the JLS makes certain guarantees (like that a byte has 8 bits and the values range from -128 to 127) - the VM just maps that to the platform as it deems suitable (the JLS was specified to match common computing technology closely, so there is usually no magic needed to guess how primitive types map to the platform).
You should never care how the VM represents data in memory, java does not offer any legal way to access the data in a manner where you would need to know (bypassing most of the VM's logic by using sun.misc.Unsafe is not considered legal).
If you care for educational purposes, learn what binary representations the underlying platform (e.g. x86) uses and take a look at the VM. It has little to do with java really, its all VM and platform specific.
For java.lang.String, its the implementation of the class that defines how the String is stored internally - it went through quite some changes over major java versions - but what that String exposes is quite narrowly defined (see JDK javadoc for String.length(), String.charAt()).
As for how user input is translated to java standard types, thats actually platform specific. The JVM selects the default encoding (e.g. String.toBytes() can return quite different results for the same string, depending on the platform - thats why its recommended to explictly specify the desired encoding). Same goes for many other things (time zone, number format etc.).
CharSets and Formats are building blocks the program wires up to translate data from the outside world (file, http or user input) into javas representation of data (or vice versa). For example, a Web application will use the encoding from a HTTP header to determine what CharSet to use when interpreting the contents (the HTTP headers encoding is defined to be US-ASCII by the spec).

Is specifying String encoding when parsing byte[] really necessary?

Supposedly, it is "best practice" to specify the encoding when creating a String from a byte[]:
byte[] b;
String a = new String(b, "UTF-8"); // 100% safe
String b = new String(b); // safe enough
If I know my installation has default encoding of utf8, is it really necessary to specify the encoding to still be "best practice"?
Different use cases have to be distinguished here: If you get the bytes from an external source via some protocol with a specified encoding then always use the first form (with explicit encoding).
If the source of the bytes is the local machine, for example a local text file, the second form (without explicit encoding) is better.
Always keep in mind, that your program may be used on a different machine with a different platform encoding. It should work there without any changes.
If I know my installation has default encoding of utf8, is it really necessary to specify the encoding to still be "best practice"?
But do you know for sure that your installation will always have a default encoding of UTF-8? (Or at least, for as long as your code is used ...)
And do you know for sure that your code is never going to be used in a different installation that has a different default encoding?
If the answer to either of those is "No" (and unless you are prescient, it probably has to be "No") then I think that you should follow best practice ... and specify the encoding if that is what your application semantics requires:
If the requirement is to always encode (or decode) in UTF-8, then use "UTF-8".
If the requirement is to always encode (or decode) in using the platform default, then do that.
If the requirement is to support multiple encodings (or the requirement might change) then make the encoding name a configuration (or command line) parameter, resolve to a Charset object and use that.
The point of this "best practice" recommendation is to avoid a foreseeable problem that will arise if your platform's characteristics change. You don't think that is likely, but you probably can't be completely sure about it. But at the end of the day, it is your decision.
(The fact that you are actually thinking about whether "best practice" is appropriate to your situation is a GOOD THING ... in my opinion.)

Shortening long urls with a hash?

I've got a file cache, the files being downloaded from different urls. I'd like to save each file by the name of their url. These names can be quite long though, and I'm on a device using a FAT32 file system - so the long names are eating up resources well before I run out of actual disk space.
I'm looking for a way to shorten the filenames, have gotten suggestions to hash the strings. But I'm not sure if the hashes are guaranteed to be unique for two different strings. It would be bad if I accidentally fetch the wrong image if two hashed urls come up with the same hash value.
You could generate an UUID for each URL and use it as the file name.
UUIDs are unique (or "practically unique") and are 36 characters long, so I guess the file name wouldn't be a problem.
As of version 5, the JDK ships with a class to generate UUIDs (java.util.UUID). You could use randomly generate UUIDs if there's a way to associate them with the URLs, or you could use name based UUIDs. Name based UUIDs are always the same, so the following is always true:
String url = ...
UUID urlUuid = UUID.nameUUIDFromBytes(url.getBytes);
assertTrue(urlUuid.equals(UUID.nameUUIDFromBytes(url.getBytes)));
There's no (shortening) hash which can guarantee different hashes for each input. It's simply not possible.
The way I usually do it is by saving the original name at the beginning (e.g., first line) of the cache file. So to find a file in the cache you do it like this:
Hash the URL
Find the file corresponding to that hash
Check the first line. If it's the same as the full URL:
The rest of the file is from line two and forward
You can also consider saving the URL->file mapping in a database.
But I'm not sure if the hashes are guaranteed to be unique for two different strings.
They very much aren't (and cannot be, due to the pigeonhole principle). But if the hash is sufficiently long (at least 64 bit) and well-distributed (ideally a cryptographic hash), then the likelihood of a collision becomes so small that it's not worth worrying about.
As a rough guideline, collisions will become likely once the number of files approaches the square root of the number of possible different hashes (birthday paradox). So for a 64 bit hash (10 character filenames), you have about a 50% chance of one single collision if you have 4 billion files.
You'll have to decide whether that is an acceptable risk. You can reduce the chance of collision by making the hash longer, but of course at some point that will mean the opposite of what you want.
Currently, the SHA-1 algorithm is recommended. There are no known ways to intentionally provoke collisions for this algorithm, so you should be safe. Provoking collisions with two pieces of data that have common structure (such as the http:// prefix) is even harder. If you save this stuff after you get a HTTP 200 response, then the URL obviously fetched something, so getting two distinct, valid URLs with the same SHA-1 hash really should not be a concern.
If it's of any re-assurance Git uses it to identify all objects, commits and folders in the source code repository. I've yet to hear of someone with a collision in the object store.
what you can do is save the files by an index and use a index file to find the location of the actual file
in the directory you have:
index.txt
file1
file2
...
etc.
and in index.txt you use some datastructure to find the filenames efficiently (or replace with a DB)
Hashes are not guaranteed to be unique, but the chance of a collision is vanishingly small.
If your hash is, say, 128 bits then the chance of a collision for any pair of entries is 1 in 2^128. By the birthday paradox, if you had 10^18 entries in your table then the chance of a collision is only 1%, so you don't really need to worry about it. If you are extra paranoid then increase the size of the hash by using SHA256 or SHA512.
Obviously you need to make sure that the hashed representation actually takes up less space than the original filename. Base-64 encoded strings represent 6 bits per character so you can do the math to find out if it's even worth doing the hash in the first place.
If your file system barfs because the names are too long then you can create prefix subdirectories for the actual storage. For example, if a file maps the the hash ABCDE then you can store it as /path/to/A/B/CDE, or maybe /path/to/ABC/DE depending on what works best for your file system.
Git is a good example of this technique in practice.
Look at my comment.
One possible solution (there are a lot) is creating a local file (SQLite? XML? TXT?) in which you store a pair (file_id - file_name) so you can save your downloaded files with their unique ID as filename.
Just an idea, not the best...

Why do we need a magic number in the beginning of the .class file?

I read a few posts here about the magic number 0xCAFEBABE in the beginning of each java .class file and wanted to know why it is needed - what is the purpose of this marking?
Is it still needed anymore? or is it just for backwards compatibility now?
Couldn't find a post that answers this - nor did I see the answer in the java spec
The magic number is basically an identifier for a file format. A JPEG for example always starts with FFD8. It is not necessary for Java itself, it simply helps to identify the file-type. You can read more about magic numbers here.
See: http://www.artima.com/insidejvm/whyCAFEBABE.html
EDIT: and http://radio-weblogs.com/0100490/2003/01/28.html
Some answers:
Well, they presumably had to pick
something as their magic number to
identify class files, and there's a
limit to how many Java or coffee
related words you can come up with
using just the letters A-F :-)
-
As to why the magic number is
3405691582 (0xCAFEBABE), well my guess
is that (a) 32-bit magic numbers are
easier to handle and more likely to be
unique, and (b) the Java team wanted
something with the Java-coffee
metaphor, and since there's no 'J' or
'V' in hexadecimal, settled for
something with CAFE in it. I guess
they figured "CAFE BABE" was sexier
than something like "A FAB CAFE" or
"CAFE FACE", and definitely didn't
like the implications of "CAFE A FAD"
(or worse, "A BAD CAFE").
-
Don't know why I missed this before,
but they could have used the number
12648430, if you choose to read the
hex zeros as the letter 'O'. That
gives you 0xC0FFEE, or 0x00C0FFEE to
specify all 32 bits. OO COFFEE? Object
Oriented, of course... :-)
-
I originally saw 0xCAFEBABE as a magic
number used by NeXTSTEP. NX used "fat
binaries", which were basically
binaries for different platforms stuck
together in one executable file. If
you were running on NX Intel, it would
run the Intel binary; if on HP, it
would run the HP binary. 0xCAFEBABE
was the magic number to distinguish
either the Intel or the Motorola
binaries ( can't remember which ).
Magic numbers are a common technique to make things, such as files, identifiable.
The idea is that you just have to read the first few bytes of the file to know if this is most likely a Java class file or not. If the first bytes are not equal to the magic number, then you know for sure that it is not a valid Java class file.
It's fairly common practice with binary files to have some sort of fixed identifier at the beginning (e.g. zip files begin with the characters PK). This reduces the possibility of accidentally trying to interpret the wrong sort of file as a class file.

Developing a (file) exchange format for java

I want to come up with a binary format for passing data between application instances in a form of POFs (Plain Old Files ;)).
Prerequisites:
should be cross-platform
information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
only sequential access is required
should be a way to check data consistency
should be small and fast
should prevent an average user with archiver + notepad from modifying the data
Currently I'm using DeflaterOutputStream + OutputStreamWriter together with InflaterInputStream + InputStreamReader to save/restore objects serialized with XStream, one object per file. Readers/Writers use UTF8.
Now, need to extend this to support the previously described.
My idea of format:
{serialized to XML object}
{delimiter}
{String file name}{delimiter}{byte[] file data}
{delimiter}
{another String file name}{delimiter}{another byte[] file data}
...
{delimiter}
{delimiter}
{MD5 hash for the entire file}
Does this look sane?
What would you use for a delimiter and how would you determine it?
The right way to calculate MD5 in this case?
What would you suggest to read on the subject?
TIA.
It looks INsane.
why invent a new file format?
why try to prevent only stupid users from changing file?
why use a binary format ( hard to compress ) ?
why use a format that cannot be parsed while being received? (receiver has to receive entire file before being able to act on the file. )
XML is already a serialization format that is compressable. So you are serializing a serialized format.
Would serialization of the model (if you are into MVC) not be another way? I'd prefer to use things in the language (or standard libraries) rather then roll my own if possible. The only issue I can see with that is that the file size may be larger than you want.
1) Does this look sane?
It looks fairly sane. However, if you are going to invent your own format rather than just using Java serialization then you should have a good reason. Do you have any good reasons (they do exist in some cases)? One of the standard reasons for using XStream is to make the result human readable, which a binary format immediately loses. Do you have a good reason for a binary format rather than a human readable one? See this question for why human readable is good (and bad).
Wouldn't it be easier just to put everything in a signed jar. There are already standard Java libraries and tools to do this, and you get compression and verification provided.
2) What would you use for a delimiter and how determine it?
Rather than a delimiter I'd explicitly store the length of each block before the block. It's just as easy, and prevents you having to escape the delimiter if it comes up on its own.
3) The right way to calculate MD5 in this case?
There is example code here which looks sensible.
4) What would you suggest to read on the subject?
On the subject of serialization? I'd read about the Java serialization, JSON, and XStream serialization so I understood the pros and cons of each, especially the benefits of human readable files. I'd also look at a classic file format, for example from Microsoft, to understand possible design decisions from back in the days that every byte mattered, and how these have been extended. For example: The WAV file format.
Let's see this should be pretty straightforward.
Prerequisites:
0. should be cross-platform
1. information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
2. only sequential access is required
3. should be a way to check data consistency
4. should be small and fast
5. should prevent an average user with archiver + notepad from modifying the data
Well guess what, you pretty much have it already, it's built-in the platform already:Object Serialization
If you need to reduce the amount of data sent in the wire and provide a custom serialization ( for instance you can sent only 1,2,3 for a given object without using the attribute name or nothing similar, and read them in the same sequence, ) you can use this somehow "Hidden feature"
If you really need it in "text plain" you can also encode it, it takes almost the same amount of bytes.
For instance this bean:
import java.io.*;
public class SimpleBean implements Serializable {
private String website = "http://stackoverflow.com";
public String toString() {
return website;
}
}
Could be represented like this:
rO0ABXNyAApTaW1wbGVCZWFuPB4W2ZRCqRICAAFMAAd3ZWJzaXRldAASTGphdmEvbGFuZy9TdHJpbmc7eHB0ABhodHRwOi8vc3RhY2tvdmVyZmxvdy5jb20=
See this answer
Additionally, if you need a sounded protocol you can also check to Protobuf, Google's internal exchange format.
You could use a zip (rar / 7z / tar.gz / ...) library. Many exists, most are well tested and it'll likely save you some time.
Possibly not as much fun though.
I agree in that it doesn't really sound like you need a new format, or a binary one.
If you truly want a binary format, why not consider one of these first:
Binary XML (fast infoset, Bnux)
Hessian
google packet buffers
But besides that, many textual formats should work just fine (or perhaps better) too; easier to debug, extensive tool support, compresses to about same size as binary (binary compresses poorly, and information theory suggests that for same effective information, same compression rate is achieved -- and this has been true in my testing).
So perhaps also consider:
Json works well; binary support via base64 (with, say, http://jackson.codehaus.org/)
XML not too bad either; efficient streaming parsers, some with base64 support (http://woodstox.codehaus.org/, "typed access API" under 'org.codehaus.stax2.typed.TypedXMLStreamReader').
So it kind of sounds like you just want to build something of your own. Nothing wrong with that, as a hobby, but if so you need to consider it as such.
It likely is not a requirement for the system you are building.
Perhaps you could explain how this is better than using an existing file format such as JAR.
Most standard files formats of this type just use CRC as its faster to calculate. MD5 is more appropriate if you want to prevent deliberate modification.
Bencode could be the way to go.
Here's an excellent implementation by Daniel Spiewak.
Unfortunately, bencode spec doesn't support utf8 which is a showstopper for me.
Might come to this later but currently xml seems like a better choice (with blobs serialized as a Map).

Categories