I understand the structure of a java .class file, but when I want to interpret the raw hex data I get a bit lost.
This is a hex dump of a class file, excluding the header and constant pool.
I understand the header to be the magic number, minor_version and major_version. It seems the next value should be the access flags.
Which value would that be in this chart? 000000b0? I thought it would be a simple number not a hex value.
Which value is this_class, the index into the constant pool where the class details can be determined?
The 000000b0 is not part of the data. It's the memory address where the following 16 bytes are located.
The two-digit hex numbers are the actual data. Read them from left to right. Each row is in two groups of eight, purely to asist in working out memory addresses etc.
So to answer your question indirectly, you can work out where the access flags are by simply counting past the number of bytes used by the magic number, minor version and major version. The access flags will come next. Likewise, to find any other values (such as this_class), you have to work out what their offset is and look at that location in the data.
You say that you expected a "simple number not a hex vaue", but that doesn;t really make sense as hex values are simple numbers. They're simply represented in base-16 instead of base-10. There are plenty of resources online that will teach you how to convert between the two.
Related
For example, if a file is 100 bits, it would be stored as 13 bytes.This means that the first 4 bits of the last byte is the file and the last 4 is not the file (useless data).
So how is this prevented when reading a file using the FileInputStream.read() function in java or similar functions in other programming language?
You'll notice if you ever use assembly, there's no way to actually read a specific bit. The smallest addressable bit of memory is a byte, memory addresses refer to a specific byte in memory. If you ever use a specific bit, in order to access it you have to use bitwise functions like | & ^ So in this situation, if you store 100 bits in binary, you're actually storing a minimum of 13 bytes, and a few bits just default to 0 so the results are the same.
Current file systems mostly store files that are an integral number of bytes, so the issue does not arise. You cannot write a file that is exactly 100 bits long. The reason for this is simple: the file metadata holds the length in bytes and not the length in bits.
This is a conscious design choice by the designers of the file system. They presumably chose the design the way they do out of a consideration that there's very little need for files that are an arbitrary number of bits long.
Those cases that do need a file to contain a non-integral number of bytes can (and need to) make their own arrangements. Perhaps the 100-bit case could insert a header that says, in effect, that only the first 100 bits of the following 13 bytes have useful data. This would of course need special handling, either in the application or in some library that handled that sort of file data.
Comments about bit-lengthed files not being possible because of the size of a boolean, etc., seem to me to miss the point. Certainly disk storage granularity is not the issue: we can store a "100 byte" file on a device that can only handle units of 256 bytes - all it takes is for the file system to note that the file size is 100, not 256, even though 256 bytes are allocated to the file. It could equally well track that the size was 100 bits, if that were useful. And, of course, we'd need I/O syscalls that expressed the transfer length in bits. But that's not hard. The in-memory buffer would need to be slightly larger, because neither the language nor the OS allocates RAM in arbitrary bit-lengths, but that's not tied tightly to file size.
I am trying make a function in Java to splice one binary input into another using Java at the bit level as opposed to a byte level. For example splicing "00000000" into "11111111" a position 3 would produce "1110000000011111".
I have tried looking into the JBBP library for dealing with binary data but it is almost as though there is a special coding language, other than java, which seems to be specific to this library which you write then pass as a string in Java to the JBBPParser.prepare() function. I tried looking in the java docs for the library but they only describe what the function does, not what the commands you can pass as a sting are, what those commands do, or what the proper syntax for those commands is.
Can anyone provide a link to the documentation for the commands you can pass as strings to the JBBPParser functions or provide an alternative way to splice binary data together at an arbitrary bit without relying on binary strings and parseInt as they are inefficient?
Read about Endianness. Java is big-endian. This means that your example is wrong. Instead of inserting the zero string (00000000) after position 3 in the target byte, your example demonstrates inserting the zero string either after position 6 or before position 5 (see bit numbering below).
The operation you are looking for is called "left shift". Here is a link discussing Bitwise and Bit Shift Operators.
Here is the strategy you want to use:
Determine how many bits to shift from the target into the destination. I will call this the initialShiftCount. In your example, the initialShiftCount is 3 bits.
Left shift the initialShiftCount number of bits into the destination.
Left shift all 8 bits from the insert string into the destination.
Left shift the remaining bits from the target into the destination.
Terms:
destination - the location into which you want to place the combination of the bits in the insert string and the target. In your example, this is the variable that ends up holding the string 1110000000011111'.<br/>
*insert string* - the bits you want to insert into the target bits. In your example, the insert string is00000000.<br/>
*target* - the bits into which you want to insert the insert string. In your example, the target is11111111`.
Bit Numbering
In a big-endian system, the bits are numbered as follows: 87654321
Introduction
We store tuples (string,int) in a binary file. The string represents a word (no spaces nor numbers). In order to find a word, we apply binary search algorithm, since we know that all the tuples are sorted with respect to the word.
In order to store this, we use writeUTF for the string and writeInt for the integer. Other than that, let's assume for now there are no ways to distinguish between the start and the end of the tuple unless we know them in advance.
Problem
When we apply binary search, we get a position (i.e. (a+b)/2) in the file, which we can read using methods in Random Access File, i.e. we can read the byte at that place. However, since we can be in the middle of the word, we cannot know where this words starts or finishes.
Solution
Here're two possible solutions we came up with, however, we're trying to decide which one will be more space efficient/faster.
Method 1: Instead of storing the integer as a number, we thought to store it as a string (using eg. writeChars or writeUTF), because in that case, we can insert a null character in the end of the tuple. That is, we can be sure that none of the methods used to serialize the data will use the null character, since the information we store (numbers and digits) have higher ASCII value representations.
Method 2: We keep the same structure, but instead we separate each tuple with 6-8 (or less) bytes of random noise (same across the file). In this case, we assume that words have a low entropy, so it's very unlikely they will have any signs of randomness. Even if the integer may get 4 bytes that are exactly the same as those in the random noise, the additional two bytes that follow will not (with high probability).
Which of these methods would you recommend? Is there a better way to store this kind of information. Note, we cannot serialize the entire file and later de-serialize it into memory, since it's very big (and we are not allowed to).
I assume you're trying to optimize for speed & space (in that order).
I'd use a different layout, built from 2 files:
Interger + Index file
Each "record" is exactly 8 bytes long, the lower 4 are the integer value for the record, and the upper 4 bytes are an integer representing the offset for the record in the other file (the characters file).
Characters file
Contiguous file of characters (UTF-8 encoding or anything you choose). "Records" are not separated, not terminated in any way, simple 1 by 1 characters. For example, the records Good, Hello, Morning will look like GoodHelloMorning.
To iterate the dataset, you iterate the integer/index file with direct access (recordNum * 8 is the byte offset of the record), read the integer and the characters offset, plus the character offset of the next record (which is the 4 byte integer at recordNum * 8 + 12), then read the string from the characters file between the offsets you read from the index file. Done!
it's less than 200MB. Max 20 chars for a word.
So why bother? Unless you work on some severely restricted system, load everything into a Map<String, Integer> and get a few orders of magnitude speed up.
But let's say, I'm overlooking something and let's continue.
Method 1: Instead of storing the integer as a number, we thought to store it as a string (using eg. writeChars or writeUTF), because in that case, we can insert a null character
You don't have to as you said that your word contains no numbers. So you can always parse things like 0124some456word789 uniquely.
The efficiency depends on the distribution. You may win a factor of 4 (single digit numbers) or lose a factor of 2.5 (10-digit numbers). You could save something by using a higher base. But there's the storage for the string and it may dominate.
Method 2: We keep the same structure, but instead we separate each tuple with 6-8 (or less) bytes of random noise (same across the file).
This is too wasteful. Using four zeros between the data byte would do:
Find a sequence of at least four zeros.
Find the last zero.
That's the last separator byte.
Method 3: Using some hacks, you could ensure that the number contains no zero byte (either assuming that it doesn't use the whole range or representing it with five bytes). Then a single zero byte would do.
Method 4: As disk is organized in blocks, you should probably split your data into 4 KiB blocks. Then you can add some time header allowing you quick access to the data (start indexes for the 8th, 16th, etc. piece of data). The range between e.g., the 8th and 16th block should be scanned sequentially as it's both simpler and faster than binary search.
Alright, so we need to store a list of words and their respective position in a much bigger text. We've been asked if it's more efficient to save the position represented as text or represented as bits (data streams in Java).
I think that a bitwise representation is best since the text "1024" takes up 4*8=32 bits while only 11 if represented as bits.
The follow up question is should the index be saved in one or two files. Here I thought "perhaps you can't combine text and bitwise-representation in one file?" and that's the reason you'd need two files?
So the question first and foremost is can I store text-information (the word) combined with bitwise-information (it's position) in one file?
Too vague in terms of whats really needed.
If you have up to a few million words + positions, don't even bother thinking about it. Store in whatever format is the simplest to implement; space would only be an issue if you need to sent the data over a low bandwidth network.
Then there is general data compression available, by just wrapping your Input/OutputStreams with deflater or gzip (already built in the JRE) you will get reasonably good compression (50% or more for text). That easily beats what you can quickly write yourself. If you need better compression there is XZ for java (implements LZMA compression), open source.
If you need random access, you're on the wrong track, you will want to carefully design the data layout for the access patterns and storage should be only of tertiary concern.
The number 1024 would at least take 2-4 bytes (so 16-32 bits), as you need to know where the number ends and where it starts, and so it must have a fixed size. If your positions are very big, like 124058936, you would need to use 4 bytes per numbers (which would be better than 9 bytes as a string representation).
Using binary files you'll need of a way to know where the string starts and end, too. You can do this storing a byte before it, with its length, and reading the string like this:
byte[] arr = new byte[in.readByte()]; // in.readByte()*2 if the string is encoded in 16 bits
in.read(arr); // in is a FileInputStream / RandomAccessFile
String yourString = new String(arr, "US-ASCII");
The other possiblity would be terminating your string with a null character (00), but you would need to create your own implementation for that, as no readers support it by default (AFAIK).
Now, is it really worth storing it as binary data? That really depends on how big your positions are (because the strings, if in the text version are separated from their position with a space, would take the same amount of bytes).
My recommendation is that you use the text version, as it will probably be easier to parse and more readable.
About using one or two files, it doesn't really matter. You can combine text and binary in the same file, and it would take the same space (though making it in two separated files will always take a bit more space, and it might make it more messy to edit).
I'm trying to compute the size of an item in dynamoDB and I'm not able to understand the definition.
The definition I found : An item size is the sum of lengths of its attribute names and values (binary and UTF-8 lengths). So it helps if you keep attribute names short.
Does it mean that if I put a number in the database, example: 1 it'll take the size of an int? a long? a double? Will it take the same amount of space than 100 or 1000000 or it'll take only the size of the corresponding binary?
And what is the computing for String?
Is there someone that knows how to compute it?
That's a non trivial topic indeed - You already quoted the somewhat sloppy definition from the Amazon DynamoDB Data Model:
An item size is the sum of lengths of its attribute names and values
(binary and UTF-8 lengths).
This is detailed further down the page within Amazon DynamoDB Data Types a bit:
String - Strings are Unicode with UTF8 binary encoding.
Number - Numbers are positive or negative exact-value decimals and integers. A number can have up to 38 digits of precision after the decimal point, and can be between 10^-128 to 10^+126. The representation in Amazon DynamoDB is of variable length. Leading and trailing zeroes are trimmed.
A similar question than yours has been asked in the Amazon DynamoDB forum as well (see Curious nature of the "Number" type) and the answer from Stefano#AWS sheds more light on the issue:
The "Number" type has 38 digits of precision These are actual decimal
digits. So it can represent pretty large numbers, and there is no
precision loss.
How much space does a Number value take up? Not too
much. Our internal representation is variable length, so the size is
correlated to the actual (vs. maximum) number of digits in the value.
Leading and trailing zeroes are trimmed btw. [emphasis mine]
Christopher Smith's follow up post presents more insights into the resulting ramifications regarding storage consumption and its calculation, he concludes:
The existing API provides very little insight in to storage
consumption, even though that is part (admittedly not that
significant) of the billing. The only information is the aggregate
table size, and even that data is potentially hours out of sync.
While Amazon does not expose it's billing data via an API yet, they they'll hopefully add an option to retrieve some information regarding item size to the DynamoDB API at some point, as suggested by Christopher.
I found this answer in amazon developer forum answered by Clarence#AWS:
eg:-
"Item":{
"time":{"N":"300"},
"feeling":{"S":"not surprised"},
"user":{"S":"Riley"}
}
in order to calculate the size of the above object:
The item size is the sum of lengths of the attribute names and values,
interpreted as UTF-8 characters. In the example, the number of bytes of
the item is therefore the sum of
Time : 4 + 3
Feeling : 7 + 13
User : 4 + 5
Which is 36
For the formal definition, refer to:
http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/WorkingWithDDItems.html
An item’s size is the sum of all its attributes’ sizes, including the hash and range key attributes.
Attributes themselves have a name and a value. Both the name and value contribute to an attribute’s size.
Names are sized the same way as string values. All values are sized differently based on their data type.
If you're interested in the nitty-gritty details, have a read of this blog post.
Otherwise, I've also created a DynamoDB Item Size and Consumed Capacity Calculator that accurately determines item sizes.
Numbers are easily DynamoDB's most complicated type. AWS does not publicly document how to determine how many bytes are in a number. They say this is so they can change the internal implementation without anyone being tied to it. What they do say, however, sounds simple but is more complicated in practice.
Very roughly, though, the formula is something like 1 byte for every 2 significant digits, plus 1 extra byte for positive numbers or 2 for negative numbers. Therefore, 27 is 2 bytes and -27 is 3 bytes. DynamoDB will round up if there’s an uneven amount of digits, so 461 will use 3 bytes (including the extra byte). Leading and trailing zeros are trimmed before calculating the size.
You can use the algorithm for computing DynamoDB item size in the DynamoDB Storage Backend for Titan DynamoDBDelegate class.
All the above answers skip the issue of storing length of attributes as well as length of attribute names and the type of each attribute.
The DynamoDB Naming Guide says names can be 1 to 255 characters long which implies a 1 byte name length overhead.
We can work back from the 400kb maximum item limit to know there's an upper limit on the length required for binary or string items - they don't need to store more than a 19bit number for the length.
Using a bit of adaptive coding, I would expect:
Numbers have a 1 byte leading type and length value but could also be coded into a single byte (eg: a special code for a Zero value number, with no value bytes following)
String and binary have 1-3 bytes leading type and length
Null is just a type byte without a value
Bool is a pair of type bytes without any other value
Collection types have 1-3 bytes leading type and length.
Oh, and DynamoDB is not schemaless. It is schema-per-item because it's storing the types, names and lengths of all these variable length items.
An approximation to how much an item occupies in your DynamoDB table is to do a get petition with the boto3 library.
This is not an exact solution on to which is the size of an element, but it will help you to make an idea. When performing a batch_get_item(**kwargs) you get a response that includes the ConsumedCapacity in the following form:
....
'ConsumedCapacity': [
{
'TableName': 'string',
'CapacityUnits': 123.0,
'ReadCapacityUnits': 123.0,
'WriteCapacityUnits': 123.0,
'Table': {
'ReadCapacityUnits': 123.0,
'WriteCapacityUnits': 123.0,
'CapacityUnits': 123.0
},
'LocalSecondaryIndexes': {
'string': {
'ReadCapacityUnits': 123.0,
'WriteCapacityUnits': 123.0,
'CapacityUnits': 123.0
}
},
'GlobalSecondaryIndexes': {
'string': {
'ReadCapacityUnits': 123.0,
'WriteCapacityUnits': 123.0,
'CapacityUnits': 123.0
}
}
},
]
...
From there you can see how much capacity units it took and you can extract and aproximated size of the item. Obviously this is based in your configuration of the system due to the fact that:
One read request unit represents one strongly consistent read request, or two eventually consistent read requests, for an item up to 4 KB in size. Transactional read requests require 2 read request units to perform one read for items up to 4 KB. If you need to read an item that is larger than 4 KB, DynamoDB needs additional read request units. The total number of read request units required depends on the item size, and whether you want an eventually consistent or strongly consistent read.
The simplest approach will be to create a item in the table and export the item to csv file which is a option available in DynamoDB. The size of the csv file will give you the item size approximately.