Convert .txt file (ANSI codification) to .Arff without losing accents

Convert .txt file (ANSI codification) to .Arff without losing accents - java

I've got serious trouble to find how to convert an .txt file in ANSI codification to .arff file in weka without loosing some accents and the meaning of a word in the process. I'm reading articles in SPANISH and the problem is that the words that have accents are bad converted because the letter with the accent is converted like this.
My original .txt | .arff file result of the conversion
Minería | Miner�a
The letter "í" is lost in the process.
My code now is this (code provided by weka university)
public Instances createDataset(String directoryPath) throws Exception {
FastVector atts = new FastVector(2);
atts.addElement(new Attribute("filename", (FastVector) null));
atts.addElement(new Attribute("contents", (FastVector) null));
Instances data = new Instances("text_files_in_" + directoryPath, atts, 0);
File dir = new File(directoryPath);
String[] files = dir.list();
for (int i = 0; i < files.length; i++) {
if (files[i].endsWith(".txt")) {
try {
double[] newInst = new double[2];
newInst[0] = (double)data.attribute(0).addStringValue(files[i]);
File txt = new File(directoryPath + File.separator + files[i]);
// meto codigo nuevo aqui dentro
// hasata aqui
InputStreamReader is;
is = new InputStreamReader(new FileInputStream(txt));
StringBuffer txtStr = new StringBuffer();
int c;
while ((c = is.read()) != -1) {
txtStr.append((char)c);
// s pstir de aqui contamino yo el codigo
// System.out.println("Sale " + is.toString();
}
newInst[1] = (double)data.attribute(1).addStringValue(txtStr.toString());
data.add(new Instance(1.0, newInst));
} catch (Exception e) {
//System.err.println("failed to convert file: " + directoryPath + File.separator + files[i]);
}
}
}
return data;
}
I'm using Netbeans to cast files from a file in my computer.
You may think that I'm asking the same thing from other posts in this page but really I'm not because what I really need is a converter that converts correctly the accents in Spanish.
I've tried to change the codification in Netbeans to UTF-8 and to ANSI, but none of the solutions worked for me ( I went to the configuration file in Netbeans8.1 --> etc --> netbeans.conf and add there -J-Dfile.encoding=UTF-8 in the line netbeans_default_options=.......... but still doesn't work). I'm getting a bit frustrated with this problem.
Well I found a partial solution after loosing my mind. In fact this solution isn't a real solution so I hoe that one day someone answers something that may save the world of datamining. The solution consist in saving the text in UTF-8 without BOM (UTF-8 sin BOM). You have also to configure Netbeans to read UTF8 as I explained above.

I had this problem, my solution was encode to ANSI.
I used Notepad ++
Steps:
Open your file
Go to the top panel
Enconding -> Encode in ANSI

Related

speech recognition with cmu sphinx - doesn't work properly

I'm trying to use CMU Sphinx for speech recognition in java but the result I'm getting is not correct and I don't know why.
I have a .wav file I recorded with my voice saying some sentence in English.
Here is my code in java:
Configuration configuration = new Configuration();
// Set path to acoustic model.
configuration.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
// Set path to dictionary.
configuration.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
// Set language model.
configuration.setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.dmp");
StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
recognizer.startRecognition(new FileInputStream("assets/voice/some_wav_file.wav"));
SpeechResult result = null;
while ((result = recognizer.getResult()) != null) {
System.out.println("~~ RESULTS: " + result.getHypothesis());
}
recognizer.stopRecognition();
}
catch(Exception e){
System.out.println("ERROR: " + e.getMessage());
}
I also have another code in Android that doesn't work as well:
Assets assets = new Assets(context);
File assetDir = assets.syncAssets();
String prefix = assetDir.getPath();
Config c = Decoder.defaultConfig();
c.setString("-hmm", prefix + "/en-us-ptm");
c.setString("-lm", prefix + "/en-us.lm");
c.setString("-dict", prefix + "/cmudict-en-us.dict");
Decoder d = new Decoder(c);
InputStream stream = context.getResources().openRawResource(R.raw.some_wav_file);
d.startUtt();
byte[] b = new byte[4096];
try {
int nbytes;
while ((nbytes = stream.read(b)) >= 0) {
ByteBuffer bb = ByteBuffer.wrap(b, 0, nbytes);
short[] s = new short[nbytes/2];
bb.asShortBuffer().get(s);
d.processRaw(s, nbytes/2, false, false);
}
} catch (IOException e) {
Log.d("ERROR: ", "Error when reading file" + e.getMessage());
}
d.endUtt();
Log.d("TOTAL RESULT: ", d.hyp().getHypstr());
for (Segment seg : d.seg()) {
Log.d("RESULT: ", seg.getWord());
}
I used this website to convert the wav file to 16bit, 16khz, mono and little-endian (tried all the options of it).
Any ideas why is doesn't work. I use the built in dictionaries and accustic models and my accent in English is not perfect (don't know if it matters).
EDIT:
This is my file. I recorded myself saying: "My baby is cute" and that's what I expect to be the output.
In the pure java code I get: "i've amy's youth" and in the android code I getl: " it"
Here is file containing the logs.

Your audio is somewhat corrupted by conversion. You should record into wav originally or into some other lossless format. Your pronunciation is also far from US English. For conversion between formats you can use sox instead of external website. Your android sample seems correct but it feels like you decode different file with android. You might check that you have actual proper file in resources.

ItextSharp - diacritic chars

i reading pdf documents via ItextSharp library.
But these documents is in Czech language which use diacritic (ř ě ž š č etc.)
How I can read this chars? Any idea? Or, is some solution for replacing this chars for normal r e z s c ?
This is code in my method. Thanks
PdfReader reader = new PdfReader("M:/ShareDirs_KSP/RDM_Debtors/DMS_PROD/" + src);
// we can inspect the syntax of the imported page
String text = new String();
for (int page = 1; page <= 1; page++) {
text += PdfTextExtractor.getTextFromPage(reader, page);
}
reader.close();

I have written a small proof of concept that parses the file czech.pdf. This file contains several characters with diacritics. It was created in answer to the following question: Can't get Czech characters while generating a PDF
The text is stored in the file twice: once using a simple font, once using a composite font. In my proof of concept (named ParseCzech), I parse this PDF to a file encoded using UTF-8 (UNICODE):
public void parse(String filename) throws IOException {
PdfReader reader = new PdfReader(filename);
FileOutputStream fos = new FileOutputStream(DEST);
for (int page = 1; page <= 1; page++) {
fos.write(PdfTextExtractor.getTextFromPage(reader, page).getBytes("UTF-8"));
}
fos.flush();
fos.close();
}
The result is the file czech.txt:
As you can see from the screen shot, the text is extracted correctly (but make sure that the viewer you use knows that the file is encoded as UTF-8, otherwise you may see strange characters instead of the actual text).
Note that some PDFs do not allow text to be extracted correctly. This is explained in the following video: http://www.youtube.com/watch?v=wxGEEv7ibHE
Please share your PDF so that people on StackOverflow can check whether you don't succeed to extract text because of an error in your code, or whether you don't succeed because the PDF doesn't allow you to extract the text.

Hebrew encoding from CSV to Eclipse to MySql comes as garbage

I have a CSV file with Hebrew characters. When I open it in TextEdit, on my Mac, I can see the Hebrew just fine.
I bring it into my Java code using a scanner, while encoding it to UTF-8:
File file = new File(System.getProperty("user.dir") + System.getProperty("file.separator") + fileName);
Scanner scanner = new Scanner(new FileInputStream(file), "UTF-8");
Then I parse, and send it to the MySql database using Hibernate:
for(int i=0; i<elements.length; i++) {
String elem = elements[i];
String[] client = elem.split(",");
for(int j=0; j<client.length; j++) {
Client c = new Client();
c.setFirstName(client[j]);
System.out.println(client[j]);
DatastoreManager.persist(c);
}
}
Both the printout in the Eclipse consol, and the entry into MySql come as ?????.
Searching for solutions I tried converting the string to bytes:
byte[] ptext = client[j].getBytes("UTF8");
String value = new String(ptext, "UTF-8");
and I converted the MySql table to Character Set UTF-8 Unicode and Collation utf8mb4_general_ci.
But nothing seems to work. Any ideas?

Use file -I {filename} in mac to check encoding.
Encoding that you get change in:
Scanner scanner = new Scanner(new FileInputStream(file), "UTF-8");
Now I suppose that you see properly encoded characters in eclipse.
Since you are using Hibernate and MySql You should add following to your hibernate configuration:
app_persistance.connection.url=jdbc:mysql://localhost:3306/yourDatabase?useUnicode=true&characterEncoding=utf-8
app_persistance.hibernate.connection.CharSet=utf8
app_persistance.hibernate.connection.characterEncoding=utf8
app_persistance.hibernate.connection.useUnicode=true

Displaying Arabic on Device J2ME

I am using some arabic text in my app. on simulator Arabic Text is diplaying fine.
BUT on device it is not displaying Properly.
On Simulator it is like مَرْحَبًا that.
But on device it is like مرحبا.
My need is this one مَرْحَبًا.

Create text resources for a MIDP application, and how to load them at run-time. This technique is unicode safe, and so is suitable for all languages. The run-time code is small, fast, and uses relatively little memory.
Creating the Text Source
اَللّٰهُمَّ اِنِّىْ اَسْئَلُكَ رِزْقًاوَّاسِعًاطَيِّبًامِنْ رِزْقِكَ
مَرْحَبًا
The process starts with creating a text file. When the file is loaded, each line becomes a separate String object, so you can create a file like:
This needs to be in UTF-8 format. On Windows, you can create UTF-8 files in Notepad. Make sure you use Save As..., and select UTF-8 encoding.
Make the name arb.utf8
This needs to be converted to a format that can be read easily by the MIDP application. MIDP does not provide convenient ways to read text files, like J2SE's BufferedReader. Unicode support can also be a problem when converting between bytes and characters. The easiest way to read text is to use DataInput.readUTF(). But to use this, we need to have written the text using DataOutput.writeUTF().
Below is a simple J2SE, command-line program that will read the .uft8 file you saved from notepad, and create a .res file to go in the JAR.
import java.io.*;
import java.util.*;
public class TextConverter {
public static void main(String[] args) {
if (args.length == 1) {
String language = args[0];
List<String> text = new Vector<String>();
try {
// read text from Notepad UTF-8 file
InputStream in = new FileInputStream(language + ".utf8");
try {
BufferedReader bufin = new BufferedReader(new InputStreamReader(in, "UTF-8"));
String s;
while ( (s = bufin.readLine()) != null ) {
// remove formatting character added by Notepad
s = s.replaceAll("\ufffe", "");
text.add(s);
}
} finally {
in.close();
}
// write it for easy reading in J2ME
OutputStream out = new FileOutputStream(language + ".res");
DataOutputStream dout = new DataOutputStream(out);
try {
// first item is the number of strings
dout.writeShort(text.size());
// then the string themselves
for (String s: text) {
dout.writeUTF(s);
}
} finally {
dout.close();
}
} catch (Exception e) {
System.err.println("TextConverter: " + e);
}
} else {
System.err.println("syntax: TextConverter <language-code>");
}
}
}
To convert arb.utf8 to arb.res, run the converter as:
java TextConverter arb
Using the Text at Runtime
Place the .res file in the JAR.
In the MIDP application, the text can be read with this method:
public String[] loadText(String resName) throws IOException {
String[] text;
InputStream in = getClass().getResourceAsStream(resName);
try {
DataInputStream din = new DataInputStream(in);
int size = din.readShort();
text = new String[size];
for (int i = 0; i < size; i++) {
text[i] = din.readUTF();
}
} finally {
in.close();
}
return text;
}
Load and use text like this:
String[] text = loadText("arb.res");
System.out.println("my arabic word from arb.res file ::"+text[0]+" second from arb.res file ::"+text[1]);
Hope this will help you. Thanks

Why is Java BufferedReader() not reading Arabic and Chinese characters correctly?

I'm trying to read a file which contain English & Arabic characters on each line and another file which contains English & Chinese characters on each line. However the characters of the Arabic and Chinese fail to show correctly - they just appear as question marks. Any idea how I can solve this problem?
Here is the code I use for reading:
try {
String sCurrentLine;
BufferedReader br = new BufferedReader(new FileReader(directionOfTargetFile));
int counter = 0;
while ((sCurrentLine = br.readLine()) != null) {
String lineFixedHolder = converter.fixParsedParagraph(sCurrentLine);
System.out.println("The line number "+ counter
+ " contain : " + sCurrentLine);
counter++;
}
}
Edition 01
After reading the line and getting the Arabic and Chinese word I use a function to translate them by simply searching for Given Arabic Text in an ArrayList (which contain all expected words) (using indexOf(); method). Then when the word's index is found it's used to call the English word which has the same index in another Arraylist. However this search always returns false because it fails when searching the question marks instead of the Arabic and Chinese characters. So my System.out.println print shows me nulls, one for each failure to translate.
*I'm using Netbeans 6.8 Mac version IDE
Edition 02
Here is the code which search for translation:
int testColor = dbColorArb.indexOf(wordToTranslate);
int testBrand = -1;
if ( testColor != -1 ) {
String result = (String)dbColorEng.get(testColor);
return result;
} else {
testBrand = dbBrandArb.indexOf(wordToTranslate);
}
//System.out.println ("The testBrand is : " + testBrand);
if ( testBrand != -1 ) {
String result = (String)dbBrandEng.get(testBrand);
return result;
} else {
//System.out.println ("The first null");
return null;
}
I'm actually searching 2 Arraylists which might contain the the desired word to translate. If it fails to find them in both ArrayLists, then null is returned.
Edition 03
When I debug I found that lines being read are stored in my String variable as the following:
"3;0000000000;0000001001;1996-06-22;;2010-01-27;����;;01989;������;"
Edition 03
The file I'm reading has been given to me after it has been modified by another program (which I know nothing about beside it's made in VB) the program made the Arabic letters that are not appearing correctly to appear. When I checked the encoding of the file on Notepad++ it showed that it's ANSI. however when I convert it to UTF8 (which replaced the Arabic letter with other English one) and then convert it back to ANSI the Arabic become question marks!

FileReader javadoc:
Convenience class for reading character files. The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
So:
Reader reader = new InputStreamReader(new FileInputStream(fileName), "utf-8");
BufferedReader br = new BufferedReader(reader);
If this still doesn't work, then perhaps your console is not set to properly display UTF-8 characters. Configuration depends on the IDE used and is rather simple.
Update : In the above code replace utf-8 with cp1256. This works fine for me (WinXP, JDK6)
But I'd recommend that you insist on the file being generated using UTF-8. Because cp1256 won't work for Chinese and you'll have similar problems again.

IT is most likely Reading the information in correctly, however your output stream is probably not UTF-8, and so any character that cannot be shown in your output character set is being replaced with the '?'.
You can confirm this by getting each character out and printing the character ordinal.

public void writeTiFile(String fileName,String str){
try {
FileOutputStream out = new FileOutputStream(fileName);
out.write(str.getBytes("windows-1256"));
} catch (Exception ex) {
ex.printStackTrace();
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert .txt file (ANSI codification) to .Arff without losing accents - java

I had this problem, my solution was encode to ANSI. I used Notepad ++ Steps: Open your file Go to the top panel Enconding -> Encode in ANSI

Related

speech recognition with cmu sphinx - doesn't work properly

ItextSharp - diacritic chars

Hebrew encoding from CSV to Eclipse to MySql comes as garbage

Displaying Arabic on Device J2ME

Why is Java BufferedReader() not reading Arabic and Chinese characters correctly?

Categories

Resources