using java for Arabic NLP - java

I'm working on Arabic natural language processing such as word stemming, tokenization etc.
In order to deal with words/chars, I need to write arabic letters in java. So, my question is that is it a good practice to write arabic letters in java directly without encoding?
example:
which one is better:
if(word.startsWith("ت"){...}
or
if(word.startsWith("\u1578"){...}

You have to write Arabic letters for the sake of readability. As for the machine, there is no big difference. Also set your character coding to UTF-8 as Arabic characters have issues with ASCII coding set.
If you are familiar with Python, then NLTK module will be of great help to you.

I would go with the real characters in your master copy, ensuring your compiler is configured for the correct encoding. You can always run it through native2ascii if you need the escaped version for any reason. Once you get going you may well find you don't actually have that many hard-coded strings in the source code, as things like gazetteer lists of potential named entities etc. are better represented as external text files.
GATE has a basic named entity annotation plugin for Arabic which may be a good starting point for your work (full disclosure: I'm one if the GATE core development team).

Related

IntelliJ IDEA Non-English Character Problem on Files

I am woking on a spring project, the used language on this projects is Turkish but my systems layout and my keyboard is English. When I try to open project with Intelij IDEA, some characters are not seen, here is the example:
Kullan\u0131m Detaylar\u0131
This is what I see on the Intellij IDEA, but istead It should be like this:
Kullanım Detayları
This is just only one example some JSP files can be even worser,
One of my jsp file looks like this:
<span>Arýza Ýþlemleri</span>
But it should be look like this:
<span>Arıza İşlemleri</span>
This is also effects my WEB output, I tried to change some of the ide properties but nothing really changed.
It is a decision, but best do all in UTF-8. Sources, JSP sources, properties and so on.
Ensure that the compilers, javac, jspc also use UTF-8. This will allow special Unicode characters like bullets, emoji and many more.
You could still deliver a Turkish page in a Turkish encoding like ISO-8859-3. But UTF-8 would do too.
\u0131 (dotless i, ı) is an encoding to ASCII. You can use this to test that the encodings are correct.
In IntelliJ set all the encoding in the preferences and the project. If you have an infrastructure like maven or gradle, set it there. For the compiler.
Possibly check with a programmer's editor like NotePad++ that the encoding is correct.
And last but not least check the database, tables/columns stored as UTF-8.
User form entry should also accept UTF-8, so the data does not go encoded as numeric entities to the server (=more efficient).
Do not use Turkish characters, instead use their unicode representatives.For example for "ı" use "\u+0131"

Who is responsible for docs and implementation of the JavaDoc-tool?

This question is related to an issue raised for Maven, which doesn't seem to escape paths forwarded to argument files supported by the JavaDoc-tool on Windows. The problem is that it's unclear from the documentation of JavaDoc itself how paths under Windows should be provided in those files.
The following is for Java 7:
If a filename contains embedded spaces, put the whole filename in double quotes, and double each backslash ("My Files\Stuff.java").
https://docs.oracle.com/javase/7/docs/technotes/tools/windows/javadoc.html#argumentfiles
The following from Java 8:
If a file name contains embedded spaces, then put the whole file name in double quotation marks.
https://docs.oracle.com/javase/8/docs/technotes/tools/windows/javadoc.html
In docs of Java 11 that part is completely missing, no mention of quotes, spaces or backslashes anymore:
https://docs.oracle.com/en/java/javase/11/javadoc/javadoc-command.html#GUID-EFE927BC-DB00-4876-808C-ED23E1AAEF7D
If you have a look at the URIs, in former versions of Java they were Windows-specific, while the last one is not. So I guess things have been refactored and some details of the argument files have been simply lost.
So, I need a place where I can talk to people about those differences in the documentation AND in the end how things are supposed to work on Windows. If backslash is an escape character in paths only and all that stuff. I would simply like to get some awareness from people who might know why the docs lack some details now and maybe even provide those details again.
So who/where do I write to? I don't know if it's Oracle or the OpenJDK project or someone completely different. Thanks!
I think, but don't take that authoritatively too lightly, that the javadoc tool is just an optional tool (can anyone show a formal obligation for any JDK to include a javadoc tool implementation ?) with a kind of de-facto standard set by the original owners, Sun thence Oracle.
But de-facto is only de-facto. Meaning formally and strictly speaking, that no JDK implementer has any obligation to make his javadoc tool behave like all the others do.
I think the best two places are the javadoc-dev mailinglist as well as the bugs database. Starting at some point in time (9, I guess) the have unified the parsing of the #files across tools. I have failed to find the code in the Mercurial repo last time.

fusion chart not supporting chinese characters

Fusion Chart not supporting chinese characters , I am using spring framework for this task, I am generating xml file dynamically using dom4j , when I use chinese character for my chart its shows some other characters,if using english means no problem will occur. kindly help me to solve this problem.
Did you specify proper encoding when reading or writing the file? The safest way to
handle chinese characters is to always provide encoding (eg. utf-8) while reading or
writing the file. Once you have a valid java string, you dont need to worry about encoding
anymore.
Anyway can you post the chinese characters you are dealing with. They might be special ones
that need special handling.

Java problems with UTF-8 in different OS

I'm programing with other people an application to college homework, and sometime we use non-english characters in comments or in Strings displayed in the views. The problem is that everyone of use is using a different OS and sometimes different IDE's to program.
Concretely, one is using MacOS, another Windows7, and another and me Ubuntu Linux. Furthermore, all of them use Eclipse and I use gedit. We have no idea if Eclipse or gedit are configurable to work propertly with UTF8 bussiness, at least I don't found nothing for mine.
The fact is that what I write with non-english characters appears in Windows & MacOS virtual machines with strange symbols and vice-versa , and sometimes, what my non-linux friends write provokes compilation warnings like this: warning: unmappable character for encoding UTF8.
Do you have any Idea to solve this? It is not really urgent but it will be a help.
Thank you.
Not sure about gedit, but you can certainly configure eclipse to use whatever encoding you like for source code. It's part of the project properties (and saved in the .settings directory within the project).
Eclipse works fine with UTF-8. See Michael's answer about configuring it. Maybe for Windows and/or MacOS it is really necessary. Ubuntu uses UTF-8 as the default encoding so I don't think it's necessary to configure Eclipse there.
As for Gedit, this picture shows that it is possible to change the encoding when saving a file in Gedit.
Anyway, you need to make sure that all of you use UTF-8 for your sources. This is the only reasonable way to achieve cross-platform portability of your sources.
You could avoid the issue in Strings by using character escape sequences, and using only ASCII encoding for the files.
For example, an en dash can be expressed as "\u2013".
You can quickly search for the Java code for individual characters here.
As Sergey notes below, this works best for small numbers of non-ASCII characters. An alternative is to put all the UTF-8 strings in resource files. Eclipse provides a handy wizard for this.
If your UTF8 file contains a BOM (byte order mark) then you will have a problem. It is a known bug , see here and here.
The BOM is optional with UTF8 and most of the time it is not there because it breaks many tools (like Javadoc, XML parser,...).
More info here.

Java File parsing toolkit design, quick file encoding sanity check

(Disclaimer: I looked at a number of posts on here before asking, I found this one particularly helpful, I was just looking for a bit of a sanity check from you folks if possible)
Hi All,
I have an internal Java product that I have built for processing data files for loading into a database (AKA an ETL tool). I have pre-rolled stages for XSLT transformation, and doing things like pattern replacing within the original file. The input files can be of any format, they may be flat data files or XML data files, you configure the stages you require for the particular datafeed being loaded.
I have up until now ignored the issue of file encoding (a mistake I know), because all was working fine (in the main). However, I am now coming up against file encoding issues, to cut a long story short, because of the nature of the way stages can be configured together, I need to detect the file encoding of the input file and create a Java Reader object with the appropriate arguments. I just wanted to do a quick sanity check with you folks before I dive into something I can't claim to fully comprehend:
Adopt a standard file encoding of UTF-16 (I'm not ruling out loading double-byte characters in the future) for all files that are output from every stage within my toolkit
Use JUniversalChardet or jchardet to sniff the input file encoding
Use the Apache Commons IO library to create a standard reader and writer for all stages (am I right in thinking this doesn't have a similar encoding-sniffing API?)
Do you see any pitfalls/have any extra wisdom to offer in my outlined approach?
Is there any way I can be confident of backwards compatibility with any data loaded using my existing approach of letting the Java runtime decide the encoding of windows-1252?
Thanks in advance,
-James
With flat character data files, any encoding detection will need to rely on statistics and heuristics (like the presence of a BOM, or character/pattern frequency) because there are byte sequences that will be legal in more than one encoding, but map to different characters.
XML encoding detection should be more straightforward, but it is certainly possible to create ambiguously encoded XML (e.g. by leaving out the encoding in the header).
It may make more sense to use encoding detection APIs to indicate the probability of error to the user rather than rely on them as decision makers.
When you transform data from bytes to chars in Java, you are transcoding from encoding X to UTF-16(BE). What gets sent to your database depends on your database, its JDBC driver and how you've configured the column. That probably involves transcoding from UTF-16 to something else. Assuming you're not altering the database, existing character data should be safe; you might run into issues if you intend parsing BLOBs. If you've already parsed files written in disparate encodings, but treated them as another encoding, the corruption has already taken place - there are no silver bullets to fix that. If you need to alter the character set of a database from "ANSI" to Unicode, that might get painful.
Adoption of Unicode wherever possible is a good idea. It may not be possible, but prefer file formats where you can make encoding unambiguous - things like XML (which makes it easy) or JSON (which mandates UTF-8).
Option 1 strikes me as breaking backwards compatibility (certainly in the long run), although the "right way" to go (the right way option generally does break backwards compatibility) with perhaps additional thoughts about if UTF-8 would be a good choice.
Sniffing the encoding strikes me as reasonable if you have a limited, known set of encodings that you tested to know that your sniffer correctly distinguishes and identifies.
Another option here is to use some form of meta-data (file naming convention if nothing else more robust is an option) that lets your code know that the data was provided according to the UTF-16 standard and behave accordingly, otherwise convert it to the UTF-16 standard before moving forward.

Categories