I am having some trouble with encoding this string into barcode symbology - Code 128.
Text to encode:
1021448642241082212700794828592311
I am using the universal encoder from idautomation.com:
https://www.bcgen.com/fontencoder/
I get the following output for the encoded text for Code 128:
Í*5LvJ8*r5;ÂoP<[7+.Î
However, in ";Âo" the character between the semi-colon and o (let us call it special A) - is not part of the extended character set used in Code128. (See the Latin Supplements at https://www.fonts2u.com/code-128.font)
Yet the same string shows a valid barcode at
https://www.bcgen.com/linear-barcode-creator.html
How?
If I use the output with the Special A on a webpage with a font face for barcodes, the special A character does not show up as the barcode (and that seems correct since the special A is not part of the character set).
What gives? Please help.
I am using the IDAutomation utility to encode the string to 128c symbology. If you can share code to do the encoding (in Java/Python/C/Perl) that would help too.
There are multiple fonts for Code128 that may use different characters to represent the barcode symbols. Make sure the font and the encoding logic match each other.
I used this one http://www.jtbarton.com/Barcodes/Code128.aspx (there is also sample code how to encode it on the site, but you have to translate it from VB). The font works for all three encodings (A, B and C).
Sorry, this is very late.
When you are dealing with the encoding of code 128, in any subset, it's a good idea to think of that coding in terms of numbers, not characters. At this level, when you have shifts, code-changes, checksums and stuff, intermixed with the data, the whole concept of "character" is lost.
However, this is what is happening:
The semicolon in the output corresponds to "27"
The lowercase o corresponds to "48" and the P to "79"
The "A with Macron" corresponds to your "00" sequence. This is why you should be dealing with numbers, not characters, at this level of encoding.
How would you expect it to show a character with a code of 00 ? That would be a space of NULL, neither of which is particularly visible.
Your software has simply rendered it the best way it can, which is to make the character 'visible' by adding 0x80 to it. If you look at charmap, you will see that code 0x80 is indeed A with macron.
The rest (indeed all) of your encoded string looks correct for a setc-encodation.
Related
I have been working on a scenario that does the following:
Get input data in Unicode format; [UTF-8]
Convert to ISO-8559;
Detect & replace unsupported characters for encoding; [Based on user-defined key-value pairs]
My question is, I have been trying to find information on ISO-8559 in depth with no luck yet. Has anybody happen to know more about this? How different is this one from ISO-8859? Any details will be much helpful.
Secondly, keeping the ISO-8559 requirement aside, I went ahead to write my program to convert the incoming data to ISO-8859 in Java. While I am able to achieve what is needed using character based replacement, it obviously seem to be time-consuming when data size is huge. [in MBs]
I am sure there must be a better way to do this. Can someone advise me, please?
I assume you want to convert UTF-8 to ISO-8859 -1, that is Western Latin-1. There are many char set tables in the net.
In general for web browsers and Windows, it would be better to convert to Windows-1252, which is an extension redefining the range 0x80 - 0xBF, undermore with special quotes as seen in MS Word. Browsers are defacto capable to interprete these codes in an ISO-559-1 even on a Mac.
Java standard conversion like new OutputStreamWriter(new FileOutputStream("..."), "Windows-1252") does already much. You can either write a kind of filter, or find introduced ? untranslated special characters. You could translate latin letters with accents not in Windows-1252 as ASCII letters:
String s = ...
s = Normalizer.normalize(s, Normalizer.Form.NFD);
return s = s.replaceAll("\\p{InCombiningDiacriticalMarks}", "");
For other scripts like Hindi or Cyrillic the keyword to search for is transliteration.
I want to output "Arabic" and "English" text at the same time in Java for example, outputting the following statement: مرحبا I am Adham.
I searched the internet and I found that the BiDi algorithm is needed in this case. Are there any java classes for BiDi.
I have tried this class BiDiReferenceJava and I tested it, but when I call runSample() in the class BidiReferenceTest and entering an arabic string as parameter, I got an OutOfIndexException as the count of the character is duplicated (exactly at this line of code in the class BidiReferenceTestCharmap)
byte[] result = new byte[count];
Where if the string length is 4 the count is 8!
The ICU4J is more or less the standard comprehensive Unicode library for Java, and thus supports the bidirectional algorithm. I really wonder why you need this, though; BiDi is usually applied by the display layer, unless you're a word-processor or something.
BidiReference.java is apparently a demonstration piece; it's designed to show how the algorithm works on ASCII characters instead of using actual Unicode characters.
I am trying to do above. One option is get a set of chars which are special characters and then with some java logic we can accomplish this. But then I have to make sure I include all special chars.
Is there any better way of doing this ?
You need to decide what constitutes a special character. One method that may be of interest is Character.getType(char) which returns an int which will match one of the constant values of Character such as Character.LOWERCASE_LETTER or Character.CURRENCY_SYMBOL. This lets you determine the general category of a character, and then you need to decide which categories count as 'special' characters and which you will accept as part of text.
Note that Java uses UTF-16 to encode its char and String values, and consequently you may need to deal with supplementary characters (see the link in the description of the getType method). This is a nuisance, but the Character method does offer methods which help you detect this situation and work around it. See the Character.isSupplementaryCodepoint(int) and Character.codepointAt(char[], int) methods.
Also be aware that Java 6 is far less knowledgeable about Unicode than is Java 7. The newest version of Java has added far more to its Unicode database, but code running on Java 6 will not recognise some (actually quite a few) exotic codepoints as being part of a Unicode block or general category, so you need to bear this in mind when writing your code.
It sounds like you would like to remove all control characters from a Unicode string. You can accomplish this by using a Unicode character category identifier in a regex. The category "Cc" contains those characters, see http://www.fileformat.info/info/unicode/category/Cc/list.htm.
myString = myString.replaceAll("[\p{Cc}]+", "");
I couldn't find any documentation about this...
I want to write to a file a bunch of char and make sure that the file's size is # of chars bytes.
does anyone know what class to use?
I want to write to a file a bunch of char and make sure that the file's size is # of chars bytes.
Okay - so you need to pick an encoding which only uses a single byte per character, such as ISO-8859-1. Create a FileOutputStream, wrap it in an OutputStreamWriter specifying the encoding, and you're away. However, you need to be aware that you're limiting the range of characters which can be represented in your file.
Take a "Writer"
Writer do output chars
http://docs.oracle.com/javase/1.4.2/docs/api/java/io/FileWriter.html
http://docs.oracle.com/javase/1.4.2/docs/api/java/io/OutputStreamWriter.html
OutputStream do output bytes
You may try to use an other encoding.
In that case you should supply an CharSetEncoder as this has an onUnmappableCharacter method
http://docs.oracle.com/javase/1.4.2/docs/api/java/nio/charset/CharsetEncoder.html#onUnmappableCharacter%28java.nio.charset.CodingErrorAction%29
First figure out which kinds of chars you are going to be talking about.
In C a char is eight bits, even if you need two or more chars in sequence to represent one glyph, or in human-terms, one typed character. It gets worse, there are also glyphs that represent two "typed" characters, like the conjoined ff and ll glyphs you often see in typesetting.
If you are talking about C chars, then by definition every file contains the same number of chars as chars. If you are talking about any other meaning of the word character, then you need to make some choices.
Eight bit characters are guaranteed for the ASCII character set in UTF-8, which is by far the best character set to choose going forward, as it has explicit support in web protocols (thank you w3c!). This means that as long as you verify that every java char in your string is less than 128 (integer value), you are going to get one byte per char with UTF-8.
ISO-8859-1 is a character set which also uses only one byte per character. The downside to ISO-8859-1 is that it tends to not be the default character set of anything other than Microsoft systems. Even within the Microsoft realm, UTF-8 has been making a lot of headway.
The cost to convert between the two is not overly high, but the extensibility of the two differ dramatically. Basically, if you are using ISO-8859-1 and someone tells you that the next product must support language "X", then in some cases, you must first convert to a different character set and then add the language support. With UTF-8 such a need to convert to another character set prior to adding support is rare. I mean very rare, like so rare that you should consider just using images because the language is likely dead, is likely of historical interest only, and is likely to have been documented as a dialect from a lesser tribe on an island where the primary language has full support.
when dealing with non-english filename.
The problem is that my program cannot gurantee those directories and filenames are in English, if some filenames using japanese, chinese character it will display some character like '?'.
anybody can suggest me wat i need to do to access non english file name
The problem is that my program cannot guarantee those directories and filenames are in English. If a filename use japanese, chinese characters it will display some character like '?'.
The problem is apparently that "it" is using the wrong character set to display the filenames. The solution depends on whether "it" is your program (via a GUI), some other application, the command shell / terminal emulator, or the user's web browser. If you could provide more information, maybe I could offer some suggestions.
But turning the characters into underscores is most likely a bad solution. It is liable to lead to filename clashes, and those Chinese / Japanese / etc characters are most likely meaningful to the people who created the files.
By the way, the correct term for "english" letters is Latin.
EDIT
For your use-case, you don't to store the PDF file using a filename that bears any relation to the supplied filename. I suggest that you try to solve the problem by using a filename consisting of Latin numbers and letters generated from (say) currentTimeInMillis(). If that fails, then your real problem has nothing to do with the filenames at all.
EDIT 2
You ask about the statement
if (fileName.startsWith("=?iso-8859"))
This seems to be trying to unpick a filename in MIME encoded-word format; see RFC 2047 Section 2
Firstly, I think that code may be unnecessary. The javadoc is not specific, but I think that the Part.getFilename() method should deal with decoding of the filename.
Second, if the decoding is necessary, then you are going about it the wrong way. The stuff after the charset cannot simply be treated as the value of the filename. Look at the RFC.
Third, if you need to you should use the relevant MimeUtility methods to decode "word" tokens ... like the filename.
Fourthly, ISO-8859-1 is NOT a suitable encoding for characters in non-Latin character sets.
Finally, examine the raw email headers of the emails that you are trying to decode and look for the header line that starts
Content-Disposition: attachment; filename=...
If the filename looks like "=?iso-8859-1?...", and the filename is supposed to contain japanese / chinese / etc characters, then the problem is in the client (or whatever) that constructed the email. The character set needs to be "utf-8" or one of the other multibyte character sets.
Java uses Unicode natively - you don't need to replace special characters, as Unicode has no special characters - every code point is treated equally. Your replaceSpChars() may be the culprit here.