Get random row from very big file with Lucene

Get random row from very big file with Lucene - java

I have a Spring-based Java webapp. And my problem is:
I have a file which has 34MB and has 2.7 million lines. Lines are just single words one after another:
abc
abcdfg
xyz
etc
I need to choose 15 random unique lines from this file which are not next to each other in a quite fast way. I know that to search such a big files I can use Apache Lucene. Do you know if Lucene can get for me these random lines. Or maybe you have some other idea that can help me to solve this problem.
I would really appreciate any help
Thanks in advance
EDIT:
Or maybe just put this file into database [PostgreSQL]?

Lucene would not work for you.
Instead just generate random numbers (make sure they are not next to each other) and then read those lines from the text file.
Here is the code that does it:
public static void main(String[] args) throws IOException
{
BufferedReader reader = new BufferedReader(new FileReader(
"MyFile.txt"));
try
{
final int MAX_NUM = <ENTER-YOUR-MAX-NUMBER-OF-LINES>;
Set<Integer> randomLines = new HashSet<Integer>();
Random rnd = new Random(System.currentTimeMillis());
for (int i = 0; i < 15; i++)
{
int aNum = rnd.nextInt(MAX_NUM);
// to make sure no lines next to each other...
if (!randomLines.contains(aNum) && !randomLines.contains(aNum+1) && !randomLines.contains(aNum-1))
{
randomLines.add(aNum);
}
}
List<String> result = new ArrayList<String>();
String aLine;
int lineNo = 0;
while ((aLine = reader.readLine()) != null)
{
if (randomLines.contains(lineNo))
{
result.add(aLine);
}
lineNo++;
}
System.out.println("Result: " + result);
}
finally
{
reader.close();
}
}

I would suggest using Mongo DB (it is not as reliable as RMDBS but it is extremally quick).
http://www.mongodb.org/display/DOCS/Quickstart
I would parse text file to Mongo documents and then retrieve random 3 doc's from Mongo db which would result in 3 random phrases.
1) In Java Read text file and save each line as separate doc in mongo, or execute commands like
in mongo direct
> doc = { phrase : 'uniquephrase'}
> db.posts.insert(doc);
2) in your java connect to the mongo, get collection size and select random 3 numbers from, then serve 3 docs... (or anything else)

Related

Converting part of .dox document to html using Apache POI

I use XHTMLConverter to convert .docx to html, to make preview of the document. Is there any way to convert only few pages from original document? I'll be grateful for any help.

You have to parse the complete .docx file. It is not possible to read just parts of it. Otherwise if you want to know how to select a specific page number, im afraid to tell you(at least I believe) that word does not store page numbers therefore there is no function in the libary to accsess a specified page..
(I've read this at another forum, it actually might be false information).
PS: the Excel POI contains a .getSheetAt()method (this might helps you for your research)
But there are also other ways to accsess your pages. For instance you could read the lines of your docx document and search for the pagenumbers(might crash if your text contains those numbers though). Another way would be to search for the header of the site which would be more accurate:
HeaderStories headerStore = new HeaderStories( doc);
String header = headerStore.getHeader(pageNumber);
this should give you the header of the specified page. Same with footer:
HeaderStories headerStore = new HeaderStories( doc);
String footer = headerStore.getFooter(pageNumber);
If this dosen't work. I am not really into that API....
here a little Example for a very sloppy solution:
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile
{
public static void main(String[] args)
{
File file = null;
WordExtractor extractor = null;
try
{
file = new File("c:\\New.doc");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument document = new HWPFDocument(fis);
extractor = new WordExtractor(document);
String[] fileData = extractor.getParagraphText();
for (int i = 0; i < fileData.length; i++)
{
if (fileData[i].equals("headerPageOne")){
int firstLineOfPageOne = i;
}
if (fileData[i]).equals("headerPageTwo"){
int lastLineOfPageOne = i
}
}
}
catch (Exception exep)
{
exep.printStackTrace();
}
}
}
If you go with this i would recommend you to create a String[] with your headers and refractor the for-loop to a seperate getPages() Method. Therefore your loop would look like:
List<String> = new ArrayList<String>(Arrays.asList("header1","header2","header3","header4"));
for (int i = 0; i < fileData.length; i++)
{
//well there should be a loop for "x" too
if (fileData[i].equals(headerArray[x])){
int firstLineOfPageOne = i;
}
if (fileData[i]).equals(headerArray[x+1]){
int lastLineOfPageOne = i
}
}
You could create an Object(int pageStart, int PageStop), wich would be the product of your method.
I hope it helped you :)

Populating ArrayList from file without including formatting info and back-slashes

To start, no this is not a homework assignment. I am fresh out of highschool and am trying to do some personal projects before college. I've been trying to populate an ArrayList with elements from a document. The document looks like:
item1
item2
item3
...
itemN
After failing many times on my own, I tried different solutions from this website. Most recently, this one got me the closest to what I desire:
public static void main(String[] args) throws IOException {
List<String> names = new ArrayList<String>();
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader("/Users/MyName/Desktop/names.txt"));
String line = null;
while ((line = reader.readLine()) != null) {
names.add(line);
}
} finally {
reader.close();
}
for(int i = 0; i<names.size(); i++){
System.out.println(names.get(i));
}
//String[] array = (String[]) names.toArray(); Not necessary that it is in an array
}
The only problem is that this returns something rather ugly in the console:
{\rtf1\ansi\ansicpg1252\cocoartf1347\cocoasubrtf570
{\fonttbl\f0\froman\fcharset0 Times-Roman;}
{\colortbl;\red255\green255\blue255;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx577\tx1155\tx1733\tx2311\tx2889\tx3467\tx4045\tx4623\tx5201\tx5779\tx6357\tx6935\tx7513\tx8091\tx8669\tx9247\tx9825\tx10403\tx10981\tx11559\tx12137\tx12715\tx13293\tx13871\tx14449\tx15027\tx15605\tx16183\tx16761\tx17339\tx17917\tx18495\tx19072\tx19650\tx20228\tx20806\tx21384\tx21962\tx22540\tx23118\tx23696\tx24274\tx24852\tx25430\tx26008\tx26586\tx27164\tx27742\tx28320\tx28898\tx29476\tx30054\tx30632\tx31210\tx31788\tx32366\tx32944\tx33522\tx34100\tx34678\tx35256\tx35834\tx36412\tx36990\tx37567\tx38145\tx38723\tx39301\tx39879\tx40457\tx41035\tx41613\tx42191\tx42769\tx43347\tx43925\tx44503\tx45081\tx45659\tx46237\tx46815\tx47393\tx47971\tx48549\tx49127\tx49705\tx50283\tx50861\tx51439\tx52017\tx52595\tx53173\tx53751\tx54329\tx54907\tx55485\tx56062\tx56640\tx57218\tx57796\li577\fi-578
\f0\fs24 \cf0 \CocoaLigature0 item1\
item2\
item3\
...
itemN\
}
How can i get it to read from the file but not include all of the back-slashes and formatting info?

you just need to really save the file as text file. Your file looks like an RTF file at the moment. Open Pages application and open that file. Go to File... Export to... Plain Text... and save it into a new file.

Looks like your names.txt file got saved as RTF (Rich Text Format). Make sure you convert it to plain text.

What is a good way to load many pictures and their reference in an array? - Java + ImageJ

I have for example 1000 images and their names are all very similar, they just differ in the number. "ImageNmbr0001", "ImageNmbr0002", ....., ImageNmbr1000 etc.;
I would like to get every image and store them into an ImageProcessor Array.
So, for example, if I use a method on element of this array, then this method is applied on the picture, for example count the black pixel in it.
I can use a for loop the get numbers from 1 to 1000, turn them into a string and create substrings of the filenames to load and then attach the string numbers again to the file name and let it load that image.
However I would still have to turn it somehow into an element I can store in an array and I don't a method yet, that receives a string, in fact the file path and returns the respective ImageProcessor that is stored at it's end.
Also my approach at the moment seems rather clumsy and not too elegant. So I would be very happy, if someone could show me a better to do that using methods from those packages:
import ij.ImagePlus;
import ij.plugin.filter.PlugInFilter;
import ij.process.ImageProcessor;
I think I found a solution:
Opener opener = new Opener();
String imageFilePath = "somePath";
ImagePlus imp = opener.openImage(imageFilePath);
ImageProcesser ip = imp.getProcessor();
That do the job, but thank you for your time/effort.

I'm not sure if I undestand what you want exacly... But I definitly would not save each information of each image in separate files for 2 reasons:
- It's slower to save and read the content of multiple files compare with 1 medium size file
- Each file adds overhead (files need Path, minimum size in disk, etc)
If you want performance, group multiple image descriptions in single description files.
If you dont want to make a binary description file, you can always use a Database, which is build for it, performance in read and normally on save.
I dont know exacly what your needs, but I guess you can try make a binary file with fixed size data and read it later
Example:
public static void main(String[] args) throws IOException {
FileOutputStream fout = null;
FileInputStream fin = null;
try {
fout = new FileOutputStream("description.bin");
DataOutputStream dout = new DataOutputStream(fout);
for (int x = 0; x < 1000; x++) {
dout.writeInt(10); // Write Int data
}
fin = new FileInputStream("description.bin");
DataInputStream din = new DataInputStream(fin);
for (int x = 0; x < 1000; x++) {
System.out.println(din.readInt()); // Read Int data
}
} catch (Exception e) {
} finally {
if (fout != null) {
fout.close();
}
if (fin != null) {
fin.close();
}
}
}
In this example, the code writes integers in "description.bin" file and then read them.
This is pretty fast in Java, since Java uses "channels" for files by default

Writing big strings to a text file?

I have strings which look like this -
String text = "item1, item2, item3, item4 etc..."
I made java code to write these strings to a text file which will be converted to csv by simply changing the extension. The logic is - print a string, then move to new line and print another string.
Output in text file was perfect when test strings had only 10-20 items.
BUT, my real strings have about 3000 unique items each. There are about 20,000 such strings.
When i write all these strings to the text file, it gets messed up.
I see 3000 rows instead of 20,000 rows.
I think there is no need for code for this problem because its been done and tested.
I only need to be able to format my data properly.
For those who want to see the code -
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
public class Texty {
public static void main(String[] args) {
System.out.println("start");
String str = "";
String enter = System.getProperty( "line.separator" );
for(int i = 0; i< 5; i++){
str = str + i + ",";
}
str = str + 5;
System.out.println(str);
FileWriter fw = null;
File newTextFile = new File("C:\\filez\\output.txt");
try {
fw = new FileWriter(newTextFile);
} catch (IOException e) {
e.printStackTrace();
}
try {
for(int i = 0; i < 10; i++){
fw.write(str + enter);
}
fw.close();
} catch (IOException iox) {
//do stuff with exception
iox.printStackTrace();
}
System.out.println("stop");
}
}

You are right that there is no difference between 10 columns and 3000 columns, you just have longer lines
Also there is no difference between 10 rows and 20,000 rows, you juts have more lines.
While you can have much, much larger files in Java or on your files system, some old versions of excel could not load so many columns (it had a limit of 256 columns) or such large files (it had a limit of about 1 GB of raw data)
I would check the file is correct in another program e.g. one you wrote and you might find all the data is there.
If the data is not there, you have a bug, There is no limitation in Java or Windows or Linux which would explain the behaviour you are seeing.

How to read and update row in file with Java

currently i creating a java apps and no database required
that why i using text file to make it
the structure of file is like this
unique6id username identitynumber point
unique6id username identitynumber point
may i know how could i read and find match unique6id then update the correspond row of point ?
Sorry for lack of information
and here is the part i type is
public class Cust{
string name;
long idenid, uniqueid;
int pts;
customer(){}
customer(string n,long ide, long uni, int pt){
name = n;
idenid = ide;
uniqueid = uni;
pts = pt;
}
FileWriter fstream = new FileWriter("Data.txt", true);
BufferedWriter fbw = new BufferedWriter(fstream);
Cust newCust = new Cust();
newCust.name = memUNTF.getText();
newCust.ic = Long.parseLong(memICTF.getText());
newCust.uniqueID = Long.parseLong(memIDTF.getText());
newCust.pts= points;
fbw.write(newCust.name + " " + newCust.ic + " " + newCust.uniqueID + " " + newCust.point);
fbw.newLine();
fbw.close();
this is the way i text in the data
then the result inside Data.txt is
spencerlim 900419129876 448505 0
Eugene 900419081234 586026 0
when user type in 586026 then it will grab row of eugene
bind into Cust
and update the pts (0 in this case, try to update it into other number eg. 30)
Thx for reply =D

Reading is pretty easy, but updating a text file in-place (ie without rewriting the whole file) is very awkward.
So, you have two options:
Read the whole file, make your changes, and then write the whole file to disk, overwriting the old version; this is quite easy, and will be fast enough for small files, but is not a good idea for very large files.
Use a format that is not a simple text file. A database would be one option (and bear in mind that there is one, Derby, built into the JDK); there are other ways of keeping simple key-value stores on disk (like a HashMap, but in a file), but there's nothing built into the JDK.

You can use OpenCSV with custom separators.
Here's a sample method that updates the info for a specified user:
public static void updateUserInfo(
String userId, // user id
String[] values // new values
) throws IOException{
String fileName = "yourfile.txt.csv";
CSVReader reader = new CSVReader(new FileReader(fileName), ' ');
List<String[]> lines = reader.readAll();
Iterator<String[]> iterator = lines.iterator();
while(iterator.hasNext()){
String[] items = (String[]) iterator.next();
if(items[0].equals(userId)){
for(int i = 0; i < values.length; i++){
String value = values[i];
if(value!=null){
// for every array value that's not null,
// update the corresponding field
items[i+1]=value;
}
}
break;
}
}
new CSVWriter(new FileWriter(fileName), ' ').writeAll(lines);
}

Use InputStream(s) and Reader(s) to read file.
Here is a code snippet that shows how to read file.
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("c:/myfile.txt")));
String line = null;
while ((line = reader.readLine()) != null) {
// do something with the line.
}
Use OutputStream and Writer(s) to write to file. Although you can use random access files, i.e. write to the specific place of the file I do not recommend you to do this. Much easier and robust way is to create new file every time you have to write something. I know that it is probably not the most efficient way, but you do not want to use DB for some reasons... If you have to save and update partial information relatively often and perform search into the file I'd recommend you to use DB. There are very light weight implementations including pure java implementations (e.g. h2: http://www.h2database.com/html/main.html).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Get random row from very big file with Lucene - java

Related

Converting part of .dox document to html using Apache POI

Populating ArrayList from file without including formatting info and back-slashes

What is a good way to load many pictures and their reference in an array? - Java + ImageJ

Writing big strings to a text file?

How to read and update row in file with Java

Categories

Resources