I'm trying to convert mainframe fixed length file from EBCDIC format to ASCII format. Currently I'm reading file using JZOS api (ZFile) and converting field by field. is it possible to convert without knowing the layout of file (aka COPYBOOK) by just reading entire bytes of a record or line? If so how do handle packed decimals and binary values?
is it possible to convert without knowing the layout of file (aka
COPYBOOK) by just reading entire bytes of a record or line?
No.
Text fields must be converted from EBCDIC to ASCII. Binary and packed decimal fields must not be converted. If you convert binary fields then you will alter their values, it's possible (likely? certain?) you will destroy their values.
Binary fields coming from a mainframe will be big-endian, you may need to convert these to little endian. +10 in a halfword on a mainframe is x'000A' while on a little endian machine it is x'0A00'.
Packed decimal fields may have implied decimal positions. If your file contains x'12345C' that may represent +123.45 or +12,345. The format of the field tells you how to interpret the data.
You cannot do the conversion without knowing the record layout including field formats.
In my experience, the best way to avoid these difficulties is to preprocess the file on the mainframe, converting all binary and packed decimal fields to text with embedded explicit signs and decimal points. Then the file can safely go through code page (EBCDIC to ASCII in this case) conversion.
Such preprocessing can easily be done with the mainframe SORT utility, which typically excel at data transformations.
Related
I designed a protocol for sending TCP/IP messages between peers in a peer-to-peer system.
A message is a byte array in which the first byte indicates the wanted operation. Then follow the arguments.
To reconstruct the arguments I read byte per byte. Because it is possible that there are multiple arguments I have to put tags in between them (I call it an end of argument byte).
What is the common way for including such tags?
Currently I use 1 byte for representing the end of argument tag (number 17). It is important that I use a/multiple byte(s) that will never be contained in an argument (else it will be interpreted as an end of argument byte).
First I thought to use number 17 as end of argument byte as that is the ASCII value for "device controller 1". But now I'm not 100% sure that it will never be contained in an argument. Arguments are files (any possible file, for example : txt, doc but also for example an image or ...).
You cannot insert separators without making any assumptions about data that will be residing between them. If your protocol is to be generic as possible then it should support byte arrays type which can potentially conflict with your separator bits.
I suggest to take the same way as the typical binary serialization formats out there (e.g. AVRO), but in your case as you don't have any kind of schema definition, you will need to adjust it a bit to have a type information inside like Thrift or Protobuf do, but without schema.
Try the following format:
[type1][length1][data][type2][length2][data2]...[lengthN][dataN]
Size of type tag can be 4 bits which gives you 16 types to be assigned, you can say type 1 is String, 2 - Image JPG, 3 -> Number long, it depends on your needs.
Length can be one byte which gives you ability to indicate length from 1 - 256, if you want larger length you can say that if length == 256 then there is a continuation of the sequence and proceed to read the same type until you find length < 256 which will be the last for this type.
Pros of this method is that you always know where is the service bytes and where is the actual data. So rather than indicating the end of the argument you will be indicating the beginning + length.
Later you can include schema tag if you will be able to categorize your messages, this will give you the ability to strip the type information of the messages and leave only the schema id and the length tags which can potentially improve the performance.
I'm generating a csv file and I have a bunch of numbers without decimal points and I'm being required to put .00 for those cases, I'm using:
DecimalFormat f = new DecimalFormat("#.00");
So fa so good I can see a string looking this way:
String myStringWithDecimalPoints = "124.00, 24567868.00, 5.00"
but when I do:
out.write(myStringWithDecimalPoints.getBytes());
I get in my csv:
124, 24567868, 5
Why is this happening?
Any workarounds? (it does have to be CSV and .00 must appear)
You have to be careful with how you view your data, especially in spreadsheets like Excel where the format of the output depends on the type of the cell. It may or may not show decimal values.
A note for the future, with a call like
out.write(myStringWithDecimalPoints.getBytes());
you can safely assume that Java is writing all the bytes to the OutputStream. If you're not seeing the same thing in the receiving side, then the receiving isn't being done like you would expect.
Most likely there is some confusion between the value of your original string, and how your string appears when you output it in certain ways using your formatter. This does not necessarily mean that your "original" string has been altered.
We could say with greater certainty if you provided a more complete code example.
I am manually serializing data objects to a file, using a ByteBuffer and its operations such as putInteger(), putDouble() etc.
One of the fields I'd like to write-out is a String. For the sake of example, let's say this contains a currency. Each currency has a three-letter ISO currency code, e.g. GBP for British Pounds Sterling.
Assuming each object I'm serializing just has a double and a currency; you could consider the serialized data to look something like:
100.00|GBP
200.00|USD
300.00|EUR
Obviously in reality I'm not delimiting the data (the pipe between fields, nor the line feeds), it's stored in binary - just using the above as an illustration.
Encoding the currency with each entry is a bit inefficient, as I keep storing the same three-characters. Instead, I'd like to have a header - which stores a mapping for currencies. The file would look something like:
100
GBP
USD
EUR
~~~
~~~
100.00|1
200.00|2
300.00|3
The first 2 bytes in the file is a short, filled with the decimal value 100. This informs me that there are 100 spaces for currencies in the file. Following this, there are 3-byte chunks which are the currencies in order (ASCII-only characters).
When I read the file back in, all I have to do is build up a 100-element array with the currency codes, and I can cheaply / efficiently look up the relevant currency for each line.
Reading the file back-in seems simple. But I'm interested to hear thoughts on writing-out the data.
I don't know all the currencies up-front, and I'm actually supporting any three-character code - even if it's invalid. Thus I have to build-up the table converting currencies to indexes on-the-fly.
I am intending on using a SeekableByteChannel to address my file, and seeking back to the header every time I find a new currency I've not indexed before.
This has obvious I/O overhead of moving round the file. But, I am expecting to see all the different currencies within the first few data objects written. So it'll probably only seek for the first few seconds of execution, and then not have to perform an additional seek for hours.
The alternative is to wait for the stream of data to finish, and then write the header once. However, if my application crashes and I haven't written-out the header, the data in the file cannot be recovered back to its original content.
Seeking seems like the right thing to do, but I've not attempted it before - and was hoping to hear horror-stories up-front, rather than through trial/error on my end.
The problem with your approach is that you say that you do not want to limit the number of currency codes which implies that you don’t know how much space you have to reserve for the header. Seeking in a plain local file might be cheap if not performed too often, but shifting the entire file contents to reserve more room for the header is big.
The other question is how you define efficiency. If you don’t limit the number of currency codes you have to be aware of the case that a single byte is not sufficient for your index so you need either a dynamic possibly-multi-byte encoding which is more complicated to parse or a fixed multi-byte encoding which ends up taking the same number of bytes as the currency code itself.
So if not space-efficiency for the typical case is more important to you than decoding efficiency you can use the fact that these codes are all made up of ASCII characters only. So you can encode each currency code in three bytes and if you accept one padding byte you can use a single putInt/getInt for storing/retrieving a currency code without the need for any header lookup.
I don’t believe that optimizing these codes further would improve you storage significantly. The table does not consist of currency codes only. It’s very likely the other data will take much more space.
I would like to know if there any difference between Java byte code and .NET byte code? If there any difference, shall I take hexadecimal values of that Java byte code and .Net byte code. Because, hexadecimal is independent of languages and it is universal specification.
Problem description
We are developing a mobile application in j2me and Java. Here I am using external finger print reader for reading/verifying finger print. We are using one Java api for reading/verifying finger print.
I capture the finger template and raw image bytes. I convert the raw image bytes into hex form and stored in a separate text file.
Here we using a conversion tool (developed in .NET) that converts the hex form into image. With the help of that tool we are trying to get the image from that text file. But we cannot get the image correctly.
The .NET programmer says the Java byte and .NET byte differ. Java byte ranges from -128 to 127. But .NET byte ranges from 0 to 255. So there is a problem.
But my assumption here is: the hex is independent of Java & .net. It is common to both. So, instead of storing byte code in text file, I plan to convert that byte code into hexadecimal format. So,our .NET conversion tool automatically convert this hexadecimal into Image.
I don't know whether I am going on correct path or not?
Hexadecimal is just a way to represent numbers.
Java is compiled to bytecode and executed by a JVM.
.NET is compiled to bytecode and executed by the CLR.
The two formats are completely incompatible.
I capture the finger template and raw image bytes .I convert the raw image bytes into hex form and stored in a separate text file.
OK; note, storing as binary would have been easier (and more efficient), but that should work
Here we using a conversion tool (developed in .NET) that converts the hex form into image.With the help of that tool we are trying to get the image from that text file.But we cannot get the image correctly.
Rather than worrying about the image, the first thing to do is check where the problem is; there are two obvious scenarios:
you aren't reading the data back into the same bytes
you have the right bytes, but you can't get it to load as an image
First; figure out which of those it is, simply by storing some known data and attempting to read it back at the other end.
The .NET programmer says the java byte and .NET byte differ.Java byte ranges from -128 to 127.But .NET byte ranges from 0 to 255.So there is a problem.
That shouldn't be a problem for any well-written hex-encode. I would expect a single java byte to correctly write a single hex value between 00 and FF.
I dont know, whether i am going on Correct path or not?
Personally, I suspect you are misunderstanding the problem, which makes it likely that the solution is off the mark. If you want to make life easier, store as binary rather than text; but there is no inherent problem exchanging hex around. If I had to pack raw binary data into a text file, personally I'd probably go for base-64 rather than hex (it will be shorter), but either is fine.
As I mentioned above: first figure out whether the problem is in reading the bytes, vs processing the bytes into an image. I'm also making the assumption that the bytes here are an image format that both environments can process, and not (for example) a custom serialization format.
Yes, Java byte code and .NET’s byte code are two different things that are not interchangeable. As to the second part of your question, I have no idea what you are talking about.
Yes they are different while there are tools that can migrate from one to an other.
Search google fro java bytecode IL comparison . This one from same search
I read from ORACLE of the following bit:
Can I execute methods on compressed versions of my objects, for example isempty(zip(serial(x)))?
This is not really viable for arbitrary objects because of the encoding of objects. For a particular object (such as String) you can compare the resulting bit streams. The encoding is stable, in that every time the same object is encoded it is encoded to the same set of bits.
So I got this idea, say if I have a char array of 4M something long, is it possible for me to compress it to several hundreds of bytes using GZIPOutputStream, and then map the whole file into memory, and do random search on it by comparing bits? Say if I am looking for a char sequence of "abcd", could I somehow get the bit sequence of compressed version of "abcd", and then just search the file for it? Thanks.
You cannot use GZIP or similar to do this as the encoding of each byte change as the stream is processed. i.e. the only way to determine what a byte means is to read all the bytes previous.
If you want to access the data randomly, you can break the String into smaller sections. That way you only need to decompress a relative short section of data.