Read String with RandomAccessFile from file with different encoding - java

I have a big file encoded 1250. Lines are just single polish words one after another:
zając
dzieło
kiepsko
etc
I need to choose random 10 unique lines from this file in a quite fast way. I did this but when I print these words they have wrong encoding [zaj?c, dzie?o, kiepsko...], I need UTF8. So I changed my code to read bytes from file not just read lines, so my efforts ended up with this code:
public List<String> getRandomWordsFromDictionary(int number) {
List<String> randomWords = new ArrayList<String>();
File file = new File("file.txt");
try {
RandomAccessFile raf = new RandomAccessFile(file, "r");
for(int i = 0; i < number; i++) {
Random random = new Random();
int startPosition;
String word;
do {
startPosition = random.nextInt((int)raf.length());
raf.seek(startPosition);
raf.readLine();
word = grabWordFromDictionary(raf);
} while(checkProbability(word));
System.out.println("Word: " + word);
randomWords.add(word);
}
} catch (IOException ioe) {
logger.error(ioe.getMessage(), ioe);
}
return randomWords;
}
private String grabWordFromDictionary(RandomAccessFile raf) throws IOException {
byte[] wordInBytes = new byte[15];
int counter = 0;
byte wordByte;
char wordChar;
String convertedWord;
boolean stop = true;
do {
wordByte = raf.readByte();
wordChar = (char)wordByte;
if(wordChar == '\n' || wordChar == '\r' || wordChar == -1) {
stop = false;
} else {
wordInBytes[counter] = wordByte;
counter++;
}
} while(stop);
if(wordInBytes.length > 0) {
convertedWord = new String(wordInBytes, "UTF8");
return convertedWord;
} else {
return null;
}
}
private boolean checkProbability(String word) {
if(word.length() > MAX_LENGTH_LINE) {
return true;
} else {
double randomDouble = new Random().nextDouble();
double probability = (double) MIN_LENGTH_LINE / word.length();
return probability <= randomDouble;
}
}
But something is wrong. Could you look at this code and help me? Maybe you see some obvious errors but not obvious for me? I will appreciate any help.

Your file is in 1250, so you need to decode it in 1250, not UTF-8. You can save it as UTF-8 after the decoding process though.
Charset w1250 = Charset.forName("Windows-1250");
convertedWord = new String(wordInBytes, w1250);

Related

Java get all class and id names from css file

I am trying to get all the classes and ids from a css file in arrays. the arrays should look like this:
UsedIds: {"#id1, "#id2" etc.etc.etc.}
UsedClasses: {".class1", ".class2" etc.etc.etc.}
how do i get these results without getting the stuff with "." inside the curly braces? I tried to remove every "{code inside}" segment but there are mediaqueries and stuff conflicting with it. My first attempt is below here, but i am not proud of it... It only removes the curly codes, but i'm stuck with this right now. Do you guys know a easier solution?
private void getCssClasses(String fileName) {
File cssFile = new File(fileName);
Scanner sc;
try {
sc = new Scanner(cssFile);
while (sc.hasNextLine()) {
String cssLine = sc.nextLine();
int firstCurly = 0;
int lastCurly = 0;
while (cssLine.contains("{")) {
for (int i = 0; i < cssLine.length(); i++) {
String character = "" + cssLine.charAt(i);
//System.out.println(character);
if (character.contains("{")) {
//System.out.println("IN");
firstCurly = i;
}
if (character.contains("}")) {
if(firstCurly != 0){
System.out.println("OUT");
lastCurly = i;
}
}
if (firstCurly != 0 && lastCurly != 0) {
StringBuilder sb = new StringBuilder(cssLine);
sb.delete(firstCurly, lastCurly);
cssLine = sb.toString();
System.out.println("YES");
break;
}
}
}
}
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

reading file java.lang.OutOfMemoryError

Trying to find a word in a large file. File is read line by line. When reading the way redLine exception is thrown. Are there any way around this? You can read it on the floor as a string?
for(String line; (line = fileOut.readLine()) != null; ){
if(line.contains(commandString))
System.out.println(count + ": " + line);
count++;
}
java.lang.OutOfMemoryError:
UDP:
this is all my bad code:
static String file = "files.txt";
static String commandString = "first";
static int count = 1;
public static void main(String[] args) throws IOException
{
try(BufferedReader fileOut = new BufferedReader(new InputStreamReader(new FileInputStream(file), "Cp1251")) ){
for(String line; (line = fileOut.readLine()) != null; ){
if(line.contains(commandString))
System.out.println(count + ": " + line);
count++;
}
System.out.println("before wr close :" + Runtime.getRuntime().freeMemory());
fileOut.close();
}catch(Exception e) {
System.out.println(e);
}
}
Searching for a word, you can read the file bytewise without holding more than a single byte of the file in memory.
Read byte by byte and every time, a byte is equal to the first byte of the searched word, start a second loop and read the following bytes and check if the next byte is equal to the next byte in the word and so on.
To give you an example, I have modified an sample to your needs.
I've omitted on the output of the file, because I don't know, if you want to output all lines or only those which contains your keyword and the latter might be as problematic as reading the code line by line.
static String fileName = "files.txt";
static byte[] searchString = { 'f', 'i', 'r', 's', 't' };
static int count = 0;
static long position = 1;
public static void main(String[] args) throws IOException {
try (FileInputStream file = new FileInputStream(fileName)) {
byte read[] = new byte[1];
outerLoop: while (-1 < file.read(read, 0, 1)) {
position++;
if (read[0] == searchString[0]) {
int matches = 1;
for (int i = 1; i < searchString.length; i++) {
if (-1 > file.read(read, 0, 1)) {
break outerLoop;
}
position++;
if (read[0] == searchString[i]) {
matches++;
} else {
break;
}
}
if (matches == searchString.length) {
System.out.println((++count)+". found at position "+ (position-matches));
}
}
}
file.close();
} catch (Exception e) {
e.printStackTrace();
}
}

Read large file error "outofmemoryerror"(java)

sorry for my english. I want to read a large file, but when I read error occurs outOfMemoryError. I do not understand how to work with memory in the application. The following code does not work:
try {
StringBuilder fileData = new StringBuilder(1000);
BufferedReader reader = new BufferedReader(new FileReader(file));
char[] buf = new char[8192];
int bytesread = 0,
bytesBuffered = 0;
while( (bytesread = reader.read( buf )) > -1 ) {
String readData = String.valueOf(buf, 0, bytesread);
bytesBuffered += bytesread;
fileData.append(readData); //this is error
if (bytesBuffered > 1024 * 1024) {
bytesBuffered = 0;
}
}
System.out.println(fileData.toString().toCharArray());
} finally {
}
You need pre allocate a large buffer to avoid reallocate.
File file = ...;
StringBuilder fileData = new StringBuilder(file.size());
And running with large heap size:
java -Xmx2G
==== update
A while loop using buffer doesn't need too memory to run. Treat input like a stream, match your search string with the stream. It's a really simple state machine. If you need search multiple words, you can find a TrieTree implementation(support stream) for that.
// the match state model
...xxxxxxabxxxxxaxxxxxabcdexxxx...
ab a abcd
File file = new File("path_to_your_file");
String yourSearchWord = "abcd";
int matchIndex = 0;
boolean matchPrefix = false;
try (BufferedReader reader = new BufferedReader(new FileReader(file))) {
int chr;
while ((chr = reader.read()) != -1) {
if (matchPrefix == false) {
char searchChar = yourSearchWord.charAt(0);
if (chr == searchChar) {
matchPrefix = true;
matchIndex = 0;
}
} else {
char searchChar = yourSearchWord.charAt(++matchIndex);
if (chr == searchChar) {
if (matchIndex == yourSearchWord.length() - 1) {
// match!!
System.out.println("match: " + matchIndex);
matchPrefix = false;
matchIndex = 0;
}
} else {
matchPrefix = false;
matchIndex = 0;
}
}
}
}
Try this. This might be helpful :-
try{
BufferedReader reader = new BufferedReader(new FileReader(file));
String txt = "";
while( (txt = reader.read()) != null){
System.out.println(txt);
}
}catch(Exception e){
System.out.println("Error : "+e.getMessage());
}
You should not hold such big files in memory, because you run out of it, as you see. Since you use Java 7, you need to read the file manually as stream and check the content on the fly. Otherwise you could use the stream API of Java 8. This is just an example. It works, but keep in mind, that the position of the found word could vary due to encoding issues, so this is no production code:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
public class FileReader
{
private static String wordToFind = "SEARCHED_WORD";
private static File file = new File("YOUR_FILE");
private static int currentMatchingPosition;
private static int foundAtPosition = -1;
private static int charsRead;
public static void main(String[] args) throws IOException
{
try (FileInputStream fis = new FileInputStream(file))
{
System.out.println("Total size to read (in bytes) : " + fis.available());
int c;
while ((c = fis.read()) != -1)
{
charsRead++;
checkContent(c);
}
if (foundAtPosition > -1)
{
System.out.println("Found word at position: " + (foundAtPosition - wordToFind.length()));
}
else
{
System.out.println("Didnt't find the word!");
}
}
catch (IOException e)
{
e.printStackTrace();
}
}
private static void checkContent(int c)
{
if (currentMatchingPosition >= wordToFind.length())
{
//already found....
return;
}
if (wordToFind.charAt(currentMatchingPosition) == (char)c)
{
foundAtPosition = charsRead;
currentMatchingPosition++;
}
else
{
currentMatchingPosition = 0;
foundAtPosition = -1;
}
}
}

Removing ASCII characters in a string with encoding

I have a byte array which is filled by a serial port event and code is shown below:
private InputStream input = null;
......
......
public void SerialEvent(SerialEvent se){
if(se.getEventType == SerialPortEvent.DATA_AVAILABLE){
int length = input.available();
if(length > 0){
byte[] array = new byte[length];
int numBytes = input.read(array);
String text = new String(array);
}
}
}
The variable text contains the below characters,
"\033[K", "\033[m", "\033[H2J", "\033[6;1H" ,"\033[?12l", "\033[?25h", "\033[5i", "\033[4i", "\033i" and similar types..
As of now, I use String.replace to remove all these characters from the string.
I have tried new String(array , 'CharSet'); //Tried with all CharSet options but I couldn't able to remove those.
Is there any way where I can remove those characters without using replace method?
I gave a unsatisfying answer, thanks to #OlegEstekhin for pointing that out.
As noone else answered yet, and a solution is not a two-liner, here it goes.
Make a wrapping InputStream that throws away escape sequences. I have used a PushbackInputStream, where a partial sequence skipped, may still be pushed back for reading first. Here a FilterInputStream would suffice.
public class EscapeRemovingInputStream extends PushbackInputStream {
public static void main(String[] args) {
String s = "\u001B[kHello \u001B[H12JWorld!";
byte[] buf = s.getBytes(StandardCharsets.ISO_8859_1);
ByteArrayInputStream bais = new ByteArrayInputStream(buf);
EscapeRemovingInputStream bin = new EscapeRemovingInputStream(bais);
try (InputStreamReader in = new InputStreamReader(bin,
StandardCharsets.ISO_8859_1)) {
int c;
while ((c = in.read()) != -1) {
System.out.print((char) c);
}
System.out.println();
} catch (IOException ex) {
Logger.getLogger(EscapeRemovingInputStream.class.getName()).log(
Level.SEVERE, null, ex);
}
}
private static final Pattern ESCAPE_PATTERN = Pattern.compile(
"\u001B\\[(k|m|H\\d+J|\\d+:\\d+H|\\?\\d+\\w|\\d*i)");
private static final int MAX_ESCAPE_LENGTH = 20;
private final byte[] escapeSequence = new byte[MAX_ESCAPE_LENGTH];
private int escapeLength = 0;
private boolean eof = false;
public EscapeRemovingInputStream(InputStream in) {
this(in, MAX_ESCAPE_LENGTH);
}
#Override
public int read(byte[] b, int off, int len) throws IOException {
for (int i = 0; i < len; ++i) {
int c = read();
if (c == -1) {
return i == 0 ? -1 : i;
}
b[off + i] = (byte) c;
}
return len;
}
#Override
public int read() throws IOException {
int c = eof ? -1 : super.read();
if (c == -1) { // Throw away a trailing half escape sequence.
eof = true;
return c;
}
if (escapeLength == 0 && c != 0x1B) {
return c;
} else {
escapeSequence[escapeLength] = (byte) c;
++escapeLength;
String esc = new String(escapeSequence, 0, escapeLength,
StandardCharsets.ISO_8859_1);
if (ESCAPE_PATTERN.matcher(esc).matches()) {
escapeLength = 0;
} else if (escapeLength == MAX_ESCAPE_LENGTH) {
escapeLength = 0;
unread(escapeSequence);
return super.read(); // No longer registering the escape
}
return read();
}
}
}
User calls EscapeRemovingInputStream.read
this read may call some read's itself to fill an byte buffer escapeSequence
(a push-back may be done calling unread)
the original read returns.
The recognition of an escape sequence seems grammatical: command letter, numerical argument(s). Hence I use a regular expression.

Counting the number of characters from a text file

I currently have the following code:
public class Count {
public static void countChar() throws FileNotFoundException {
Scanner scannerFile = null;
try {
scannerFile = new Scanner(new File("file"));
} catch (FileNotFoundException e) {
}
int starNumber = 0; // number of *'s
while (scannerFile.hasNext()) {
String character = scannerFile.next();
int index =0;
char star = '*';
while(index<character.length()) {
if(character.charAt(index)==star){
starNumber++;
}
index++;
}
System.out.println(starNumber);
}
}
I'm trying to find out how many times a * occurs in a textfile. For example given a text file containing
Hi * My * name *
the method should return with 3
Currently what happens is with the above example the method would return:
0
1
1
2
2
3
Thanks in advance.
Use Apache commons-io to read the file into a String
String org.apache.commons.io.FileUtils.readFileToString(File file);
And then, use Apache commons-lang to count the matches of *:
int org.apache.commons.lang.StringUtils.countMatches(String str, String sub)
Result:
int count = StringUtils.countMatches(FileUtils.readFileToString(file), "*");
http://commons.apache.org/io/
http://commons.apache.org/lang/
Everything in your method works fine, except that you print the count per line:
while (scannerFile.hasNext()) {
String character = scannerFile.next();
int index =0;
char star = '*';
while(index<character.length()) {
if(character.charAt(index)==star){
starNumber++;
}
index++;
}
/* PRINTS the result for each line!!! */
System.out.println(starNumber);
}
int countStars(String fileName) throws IOException {
FileReader fileReader = new FileReader(fileName);
char[] cbuf = new char[1];
int n = 0;
while(fileReader.read(cbuf)) {
if(cbuf[0] == '*') {
n++;
}
}
fileReader.close();
return n;
}
I would stick to the Java libraries at this point, then use other libraries (such as the commons libraries) as you become more familiar with the core Java API. This is off the top of my head, might need to be tweaked to run.
StringBuilder sb = new StringBuilder();
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
String s = br.readLine();
while (s != null)
{
sb.append(s);
s = br.readLine();
}
br.close(); // this closes the underlying reader so no need for fr.close()
String fileAsStr = sb.toString();
int count = 0;
int idx = fileAsStr('*')
while (idx > -1)
{
count++;
idx = fileAsStr('*', idx+1);
}

Categories