Auto-Detect Character Encoding in Java

Auto-Detect Character Encoding in Java - java

Seems to be a fairly hit issue, but I've not yet been able to find a solution; perhaps because it comes in so many flavors. Here it is though. I'm trying to read some comma delimited files (occasionally the delimiters can be a little bit more unique than commas, but commas will suffice for now).
The files are supposed to be standardized across the industry, but lately we've seen many different types of character set files coming in. I'd like to be able to set up a BufferedReader to compensate for this.
What is a pretty standard way of doing this and detecting whether it was successful or not?
My first thoughts on this approach are to loop through character sets simple->complex until I can read the file without an exception. Not exactly ideal though...
Thanks for your attention.

The Mozilla's universalchardet is supposed to be the efficient detector out there. juniversalchardet is the java port of it. There is one more port. Read this SO for more information Character Encoding Detection Algorithm

Related

how to compare a value's encoding of string type with a specific encoding in java?

I'm told to write a code that get a string text and check if its encoding is equal the specific encoding that we want or not. I've searched a lot but I didn't seem to find anything. I found a method (getEncoding()) but it just works with files and that is not what I want. and also I'm told that i should use java library not methods of mozilla or apache.
I really appreciate any help. thanks in advance.

What you are thinking of is "Internationalization". There are libraries for this like, Loc4j, but you can also get this using java.util.Locale in Java. However in general text is just text. It is a token with a certain value. No localization information is stored in the character. This is why a file normally provides the encoding in the header. A console or terminal can also provide localization using certain commands/functions.
Unless you know the source encoding and the token used you will have a limited ability to guess what encoding is used in the other end. If you still would want to do this you will need to go into deeper areas such as decryption where this kind of stuff usually is done using statistic analysis. This in turn requires databases on the usage of different tokens and depending on the quality of the text, databases and algorithms a specific amount of text is required. Special stuff, like writing Swedish with eg. US encoding (like using a for å and ä or o for ö) will require more advanced analysis.
EDIT
Since I got a comment that encoding and internationalization is different entities I will add some comments. It is possible to work with different encodings working plainly with English (like some English special characters). It is also possible to work with encodings using for example Charset. However for many applications using different encodings it may still be efficient to use Locale, since this library can do a lot of operations on text with different encodings.

Thanks for ur answers and contribution but these two link did the trick. I had already seen these two pages but it didn't seem to work for me cause I was thinking about get the encoding directly and then compare it with the specific one.
This is one of them
This is another one.

Is it possible to do this type of search in Java

I am stuck on a project at work that I do not think is really possible and I am wondering if someone can confirm my belief that it isn't possible or at least give me new options to look at.
We are doing a project for a client that involved a mass download of files from a server (easily did with ftp4j and document name list), but now we need to sort through the data from the server. The client is doing work in Contracts and wants us to pull out relevant information such as: Licensor, Licensee, Product, Agreement date, termination date, royalties, restrictions.
Since the documents are completely unstandardized, is that even possible to do? I can imagine loading in the files and searching it but I would have no idea how to pull out information from a paragraph such as the licensor and restrictions on the agreement. These are not hashes but instead are just long contracts. Even if I were to search for 'Licensor' it will come up in the document multiple times. The documents aren't even in a consistent file format. Some are PDF, some are text, some are html, and I've even seen some that were as bad as being a scanned image in a pdf.
My boss keeps pushing for me to work on this project but I feel as if I am out of options. I primarily do web and mobile so big data is really not my strong area. Does this sound possible to do in a reasonable amount of time? (We're talking about at the very minimum 1000 documents). I have been working on this in Java.

I'll do my best to give you some information, as this is not my area of expertise. I would highly consider writing a script that identifies the type of file you are dealing with, and then calls the appropriate parsing methods to handle what you are looking for.
Since you are dealing with big data, python could be pretty useful. Javascript would be my next choice.
If your overall code is written in Java, it should be very portable and flexible no matter which one you choose. Using a regex or a specific string search would be a good way to approach this;
If you are concerned only with Licensor followed by a name, you could identify the format of that particular instance and search for something similar using the regex you create. This can be extrapolated to other instances of searching.
For getting text from an image, try using the API's on this page:
How to read images using Java API?
Scanned Image to Readable Text
For text from a PDF:
https://www.idrsolutions.com/how-to-search-a-pdf-file-for-text/
Also, PDF is just text, so you should be able to search through it using a regex most likely. That would be my method of attack, or possibly using string.split() and make a string buffer that you can append to.
For text from HTML doc:
Here is a cool HTML parser library: http://jericho.htmlparser.net/docs/index.html
A resource that teaches how to remove HTML tags and get the good stuff: http://www.rgagnon.com/javadetails/java-0424.html
If you need anything else, let me know. I'll do my best to find it!

Apache tika can extract plain text from almost any commonly used file format.
But with the situation you describe, you would still need to analyze the text as in "natural language recognition". Thats a field where; despite some advances have been made (by dedicated research teams, spending many person years!); computers still fail pretty bad (heck even humans fail at it, sometimes).
With the number of documents you mentioned (1000's), hire a temp worker and have them sorted/tagged by human brain power. It will be cheaper and you will have less misclassifications.

You can use tika for text extraction. If there is a fixed pattern, you can extract information using regex or xpath queries. Other solution is to use Solr as shown in this video.You don't need solr but watch the video to get idea.

Find patterns in long string?

I have 40KB HTML page and I want to find certain patterns in it.
I can read it by 1K buffer but I want to avoid situation that pattern that I'm searching would be split between two buffer reads.
How to overcome this problem?

This is easy. You count the longest pattern you will look for, then either backtrack the file pointer by that amount, or you scroll through the file, reading only the delta.
Imagine the longest pattern being 26 bytes.
Read 1k.
Check for all patterns -> nothing.
Drop 1k - 26 bytes from the buffer.
Read 1k - 26 bytes from stream and add to your buffer
Goto 2.
Edit: Let me clarify: There are two methods to do this, both have their merits. The one I documented above is best used if you are reading from a stream, which means a data source that does not support seeking. If, however, your datasource does support seeking (like a filesystem file), you can easily do the same with seeks. Check for pattern, if not found, seek back the size of your longest pattern, then start from there.
If, however, you want to support the search for patterns that are longer than your buffer size, you might need a much more clever algorithm. You would need a lookup table of all patterns that are currently "open" when you contnue to read more data, which in turn will cost more memory - you get the problem.

That's what the Scanner class is for.

You could take a look at CharBuffer, which implements CharSequence for just this purpose

Why not use a SAX parser. It is build to handle large files of mark-up. You would ony run into problems if you are trying to match across different elements along the same level. However this is not impossible to handle

File upload-download in its actual format

I've to make a code to upload/download a file on remote machine. But when i upload the file new line is not saved as well as it automatically inserts some binary characters. Also I'm not able to save the file in its actual format, I've to save it as "filename.ser". I'm using serialization-deserialization concept of java.
Thanks in advance.

How exactly are you transmitting the files? If you're using implementations of InputStream and OutputStream, they work on a byte-by-byte level so you should end up with a binary-equal output.
If you're using implementations of Reader and Writer, they convert the bytes to characters according to some character mapping, and then perform the reverse process when saving. Depending on the platform encodings of the various machines (and possibly other effects if you're not specifying the charset explicitly), you could well end up with differences in the binary file.
The fact that you mention newlines makes me think that you're using Readers to send strings (and possibly that you're stitching the strings back together yourself by manually adding newlines). If you want the files to be binary equal, then send them as a stream of bytes and store that stream verbatim. If you want them to be equal as strings in a given character set, then use Readers and Writers but specify the character set explicitly. If you want them to be transmitted as strings in the platform default set (not very useful), then accept that they're not going to be binary equal as files.
(Also, your question really doesn't provide much information to solve it. To me, it basically reads "I wrote some code to do X, and it doesn't work. Where did I go wrong?" You seem to assume that your code is correct by not listing it, but at the same time recognise that it's not...)

Rewriting Binary Streams using Java

I have been studying Netty and Mina but am confused as to the best way to rewrite binary streams. For example, I would like to create a proxy that will allow for replacement of XML and forward along.
Examples appreciated.

I think you're thinking at too low of a level. XML is not so much "binary" as it is an abstraction on top of binary. If you want to replace snippets of XML as they come across your line, you'll have to poke into the payload portion of the packets and look for patterns of XML.. a simple way is to use a regular expression after rebuilding the bytes into content temporarily.
Once you have this search and you have matched what you want, you can replace what you want to replace and re-send.
The hard part of this is that you will likely need to cache some input before it leaves your machine so that you are able to find the beginning and end of what it is you are searching for. What makes this difficult is that often times, you don't know what constitutes the "beginning" and the "end" of a data payload.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.