Lucene encoding, java - java

I have questions about encoding in Lucene (java).
How is working with coding in Lucene? which is the default and how can I set it?
Or Lucene does not matter what it is encoding and it's just a matter of how adding a string to a document (java code is below) in the indexing phase, and then in the search in the index?
In other words, I have to worry if the input text is in UTF-8 and query are also in utf-8?
Document doc = new Document ();
doc.add (new TextField (tagName, object.getName () Field.Store.YES));
Thanks for any help

Lucene stores terms in UTF-8. (See Lucene's BytesRef class)
Java internally stores everything in UTF-16. (Java's String is UTF-16). So, Lucene's BytesRef gives you a constructor where it converts UTF16 to UTF8. Hence Java's String can be used without any issues.
For example, TextField what you have used in your code uses String for Field value.
If you have some other type of Field which takes byte[] then you need to make sure they are UTF8 bytes.
While querying, Lucene will always give you UTF-8 bytes, however you can convert that to Java's String by a method provided in the same class.You can always interpret these bytes in other character sets.
You have to take care of Character Encoding yourself - as long as you can get the characters right in Java's String, you should be fine. For eg: If the data you are indexing is from an XML with a diff char set or reading from a DB in a diff char set. You will have to make sure that you can read these data sources properly in the JVM used for indexing.

Related

Is 'the local character set' the same as 'the encoding of the text data you want to process'?

The Oracle Java Documentation states the following boast in its Tutorial introduction to character streams:
A program that uses character streams in place of byte streams automatically adapts to the local character set and is ready for internationalization — all without extra effort by the programmer.
(http://docs.oracle.com/javase/tutorial/essential/io/charstreams.html)
My question is concerned with the meaning of the word 'automatically' in this context. Elsewhere the documentation warns
Data in text files is automatically converted to Unicode when its encoding matches the default file encoding of the Java Virtual Machine.... If the default file encoding differs from the encoding of the text data you want to process, then you must perform the conversion yourself. You might need to do this when processing text from another country or computing platform.
(http://docs.oracle.com/javase/tutorial/i18n/text/convertintro.html)
Is 'the local character set' in the first quote analogous to 'the encoding of the text data you want to process' of the second quote? And if so, is the second quote not exploding the boast of the first - that you don't need to do any conversion unless you need to do a conversion?
In the context of the first tutorial you have linked, I read it that they use "local character set" to mean the default character set.
For example:
inputStream = new FileReader("xanadu.txt");
They are creating a FileReader, which does not allow you to specify a Charset, so the JVM's default charset will be used:
FileReader(String) calls
InputStreamReader(InputStream), which calls
StreamDecoder.forInputStreamReader(InputStream, Object, String), with null as the last parameter
So Charset.defaultCharset() is used as the Charset
If you wanted to use an explicit charset, you would write:
inputStream = new InputStreamReader(new FileInputStream("xanadu.txt"), charset);
No. The local character set is the character set (table of character values and respective codes) that the file uses, but the default text encoding is how the JVM interprets the characters (converts them into their character codes). They are linked and very similar, but not exactly the same.
Also, it says that it "automatically" converts it because that is the function of the JVM: it automatically converts the characters in the text file that contains your code into code that the machine can read.

When is encoding being relevant in Java?

This might be a bit beginner question but it's fairly relevant considering debbuging encoding in Java: At what point is an encoding being relevant to a String object?
Consider I have a String object that I want to save to a file. Is the String object itself using some sort of encoding I should manipulate or this encoding will only be informed by me when I create a stream of bytes to save?
The same applies to importing: when I open a file and get it's bytes, I assume there's no encoding at hand, only bytes. When I parse this bytes to a String, I got to use an encoding to understand what characters are they. After I parse those bytes, the String (in memory) has some sort of meta information with the encoding or this is only being handled by the JVM?
This is vital considering I'm having file import/export issues and I got to understand at which point I should worry about getting the right encoding.
Hope I explained my doubt well, and thank you in advance!
Java strings do not have explicit encoding information. They don't know where they came from, and they don't know where they are going. All Java strings are stored internally as UTF-16.
You (optionally) specify what encoding to use whenever you want to turn a String into a sequence of bytes (e.g., to save to a file), or when you want to turn a sequence of bytes (e.g., read from a file) into a String.
Encoding is important to String when you are de/serializing from disk or the web. There are multiple text file formats: ascii, latin-1, utf-8/16 (I believe there may be two utf-16 formats, but I'm not 100%)
See InputStreamReader for how to load a String from text encoded in a non-default format

Encoding issues

I have a "windows1255" encoded String, is there any safe way i can convert it to a "UTF-8"
String and vice versa?
In general is there a safe way(meaning data will not be damaged) to convert between
Encodings in Java?
str.getBytes("UTF-8");
new String(str,"UTF-8");
if the original string is not encoded as "UTF-8" can the data be damaged?
You can can't have a String object in Java properly encoded as anything other than UTF-16 - as that's the sole encoding for those objects defined by the spec. Of course you could do something untoward like put 1252 values in a char[] and create a String from it, but things will go wrong pretty much immediately.
What you can have is byte[] encoded in various different ways, and you can convert them to and from String using constructors which take a Charset, and with getBytes as in your code.
So you can do conversions using a String as an intermediate. I don't know of any way in the JDK to do a direct conversion, but the intermediate is likely not too costly in practice.
About round-trip comversions - it is not generally true that you can convert between encoding without losing data. Only a few encodings can handle the full spectrum of Unicode characters (eg the UTF family, GB18030, etc) - while many legacy character sets encode only a small subset. You can't safely round trip through those character sets without losing data, unless you are sure the input falls into the representable set.
String is attempting to be a sequence of abstract characters, it does not have any encoding from the point of view
of its users. Of course, it must have an internal encoding but that's an implementation detail.
It makes no sense to encode String as UTF-8, and then decode the result back as UTF-8. It will be no-op, in that:
(new String(str.getBytes("UTF-8"), "UTF-8") ).equals(str) == true;
But there are cases where the String abstraction falls apart and the above will be a "lossy" conversion. Because of the internal
implementation details, a String can contain unpaired UTF-16 surrogates which cannot be represented in UTF-8 (or any encoding
for that matter, including the internal UTF-16 encoding*). So they will be lost in the encoding, and when you decode back, you get the original string without the invalid unpaired surrogates.
The only thing I can take from your question is that you have a String result from interpreting binary data as Windows-1255, where it should have been interpreted in UTF-8.
To fix this, you would have to go to the source of this and use UTF-8 decoding explicitly.
If you however, only have the string result from misinterpretation, you can't really do anything as so many bytes have no representation in Windows-1255 and would have not made it to the string.
If this wasn't the case, you could fully restore the original intended message by:
new String( str.getBytes("Windows-1255"), "UTF-8");
* It is actually wrong of Java to allow unpaired surrogates to exist in its Strings in the first place since it's not valid UTF-16

Why does ICU4J return the byte-order-mark when reading an array of bytes into a String?

I read a file into an array of bytes. Then I use ICU4J to detect the file's encoding (I don't know what the encoding might be, these files can have multiple different encodings) and return a Unicode String.
Like so:
byte[] fileContent = // read file into byte array
CharsetDetector cd = new CharsetDetector();
cd.setText(fileContent);
CharsetMatch cm = cd.detect();
String result = cm.getString();
When my file is encoded using UTF-16LE the first character in "result" is the byte-order-mark. I'm not interested in this and because it is specific to the encoding scheme and not really part of the file content I would expect it to be gone.
Yet ICU4J returns it. Why is this happening and is there some way of getting around this problem? The only solution I see is manually checking if the first character in the returned String is the byte order mark and stripping it manually. Is there some cleaner/better way?
I just consulted the docs ... icu-project.org/apiref/icu4j/com/ibm/icu/text/…. They in fact say that it returns the corresponding Java String but they do not say anything about removing the BOM. So I'd expect it to be there if it was in the first place.
To me it is natural that it is also retrieved. I'd expect them to explicitly mention it in the docs if they were trimming out the BOM.
I think the answer is here unicode.org/faq/utf_bom.html#bom1 - "Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol."
I think that's pretty much it. If a BOM is mandatory, you'd have to add it again. Filtering it out if the BOM is prohibited is considered the easy part I guess :)

JSoup encoding issue with numeric character references

We're doing the following:
Open a Reader for a file, using some specified encoding.
Read in each line, parsing it as CSV.
For some of the columns in the CSV data, pass it to JSoup to clean out HTML as below:
public String apply(#Nullable String input) {
Document document = Jsoup.parse(input);
return document.text();
}
This works great, except in the presence of numeric character references, such as  . What seems to be happening is that since we necessarily must do the JSoup call after we've figured out the encoding (to get the CSV parsing to work), when JSoup gets round to converting hard-coded bytes into characters, we're working with the wrong character set. Byte 160 (0xa0) is non-breaking space in windows-1252, but is not a valid Unicode character so gives us bad data when JSoup is replacing the numeric character reference with a byte.
Is there a way around this? It would require JSoup to be given a 'source encoding' for numeric character references, or something like that.
Try calling the following before text():
document.outputSettings().charset("windows-1252");
For more output settings see the javadoc.

Categories