I have a dictionary file that is being used for word matching, the java code is to be submitted online and get executed.(for a online coding competition)
How would I be able to use the dictionary data file, while my program executes online.
could it be embedded in the source code as compressed byte stream?
please suggest
There are multiple ways to achieve this:
either refer to the dictionairy file as a remote resource in your code. This means that you ll most your dictionary file on a different online location which is well known by your online application code. You can then download the dictionary file and cache the file in memory for usage
You can encode the dictionary file (for instance in Base64 encoding - to take care of special characters in the dictonary file) as a predefined datastructure / buffer in your code. This means however that you need to convert your dictionary file & rebuild your application each time you adapt the dictionary file.
Pointing to a different "online" location would seem to more suitable solution.
Related
I am trying to create a dictionary-based tagger running on a Hadoop cluster using Pig. Basically, what it does, is for each document (quite large text documents, up to a few MBs) to run each word in each sentence against the dictionary to read the corresponding value.
There will be up to a few hundred java programs (not threads) running in parallel, using the dictionary file in read-only mode. The idea is to load the dictionary from text and create a Map to query against it.
Question: what should I be prepared for? Is it even remotely logic to want to read a file in a multiprogramming environment or should I first copy the (relatively small) file for each instance of the program? Is a BufferedReader something I should use while reading the file?
There is very little structured documentation on multiprogramming (compared to multithreading) so I am a bit afraid of running against a wall by doing so.
Note: you are only allowed to answer that my way of thinking is totally wrong if you provide me with a better way ;-)
I think your approach is fine. You should load your dictionary from the DistributedCache to memory, and do the checks with the memory-loaded dictionary (e.g., a HashMap).
I'm trying to understand how to randomly traverse a file/files in a .tar.gz using TrueZIP in a Java 6 environment( using the Files classes). I found instances where it uses Java 7's Path, however, I can't come up with an example on how to randomly read an archive on Java 6.
Additionally, does "random" reading mean that it first uncompresses the entire archive, or does it read sections in the compressed file? The purpose is that I want to retrieve some basic information from the file without having to uncompress the entire thing just to read it(ie username).
The method that gzip uses to compress a file (especially .tar.gz files) usually implies that the output file is not random-accessible - you need the symbol table and other context from the entire file up to the current block to even be able to uncompress that block to see what's in it. This is one of the ways it achieves (somewhat) better compression over ZIP/pkzip, which compress each file individually before adding them to a container archive, resulting in the ability to seek to a specific file and uncompress just that file.
So, in order to pick a .tar.gz apart, you will need to uncompress the whole thing, either to a temporary file or in memory (if it's not too large), then you can jump to specific entries in the underlying .tar file, although that has to be done sequentially by skipping from header to header, as tar does not include a central index/directory of files.
I am not aware of TrueZip in particular, but at least in terms of Zip, RAR and Tar you can access single files and retrieve details about them and even extract them without touching the rest of the package.
Additionally, does "random" reading mean that it first uncompresses
the entire archive
If TrueZip follows Zip/RAR/Tar format, then it does not uncompress the entire archive.
The purpose is that I want to retrieve some basic information from the
file without having to uncompress the entire thing just to read it(ie
username).
As previously, that should be fine -- I don't know TrueZip API in particular, but file container formats allow you to inspect file info without reading a single bit of the data, and optionally extract/read the file contents without touching any other file in the container.
The source code comment of zran describes how such tools are working:
http://svn.ghostscript.com/ghostscript/tags/zlib-1.2.3/examples/zran.c
In conclusion one can say that the complete file has to be processed for generating the necessary index.
That is much faster than actually decompress everything.
The index allows to split the file into blocks that can be decompressed without having to decompress the blocks before. That is used for emulating random access.
my app download files from server into app. There could be lots of those file. One file is about 100 mb. I need to do something to safely keep them into my app.
Thirst i tried to encrypt files. How ever this is bad solution because to encrypt and decrypt 100 mb file (it's pdf file) take a some time. Also i need at a time to read this file so i need to decrypt and write decrypted file into some other file for reading at this time files is reachable.
Furthermore i can't keep this file in memory, because of file size. So maybe there is the way to encrypt directory in internal storage where file is saved ? Or this is not good idea as i should then encrypt every file in directory.
As my files is pdf, i could put password to int, but then how to do this ? Also i could try to check if device is rooted or not, but i think someone would find workaround.
So what would you suggest ?
Thanks
It seems like you have 3 options: to encrypt your data; to store the pdfs in a private folder; or to not store the files on-device.
1) Encrypt your data: As you've said, there are disadvantages because the pdfs are quite big and if you can't have those stored in memory, you need to write the decrypted files to file anyway before displaying them, so this doesn't really solve your problem.
2) Store the pdfs in a private folder: Alternatively you could store the pdfs in a private folder only accessible through your app. This can be done using
FileOutputStream fos = openFileOutput(FILENAME, Context.MODE_PRIVATE);
as noted here. "MODE_PRIVATE will create the file (or replace a file of the same name) and make it private to your application". The only problem I see with this is if people are using rooted phones and can access your app's private folders. The only way around this (as far as I know) is to use option 3.
3) Don't store the files on device: You could download the data, or parts of it, each time. This will guarantee that people can't copy the files because they never persist on the device. You could use Google Docs to stream only portions of the document to reduce download requirements if you want. The problem with this is the huge data requirement.
I think you need to weigh up the pros and cons and decide which is best for you. I'd personally go with option 2. I don't think you'll find a solution that addresses all the problems.
I'm working on a java project and i have to read some files like these:
- EntryID.data
- EntryID.index
- KeyText.data
- KeyText.index
...
I think these files are used in a dictionary project but i can't find a any document about this. How can i read them or know the format of them ? Sorry for my english =.=
Thanks alot!
This looks like files from a database management system. One file to store the data, another one to store at least one index to speed up queries.
I'd start with a hex editor and look at the file. Sometimes, the content binaries gives a hint.
Another idea: look at the classpath and inspect property and resource files. Maybe you'll find a database driver or some config files with jdbc connect strings.
Google told me, that all four files are used by Apple's Dictionary.app. Have a look at this blog, this can point you in the correct direction.
Last note - reading undocumented binaries is a challenge. I usually start with 010 Editor to analyse the datastructure and develop a java based test tool to read the data. It's some sort of try and error evolutionary process.
Well, this is kinda difficult. data could mean anything.
You could try the UNIX utility file or open the file with a hex editor and look for interesting strings (the utility strings is helpful for that too).
Some information is in info.plist.
KeyText.data is sometimes compressed using zlib. 78 9C is well-known zlib-header so you can decompress when you find it. Size of decompressed entry comes before compressed entry.
Size of entry comes before entry of array.
C# library is in https://github.com/kurema/MacDictionaryGeneral. But *.index is too difficult to understand and implement. info.plist says *.index is trie index which is not enough information to understand fully.
This question already has answers here:
How to create my own file extension like .odt or .doc? [closed]
(3 answers)
Closed 8 years ago.
I'm on my way in developing a desktop application using netbeans(Java Dextop Application) and I need to implement my own file format which is specific to that application only. I'm quite uncertain as to how should I go about first.What code should I use so that my java application read that file and open it in a way as I want it to be.
If it's character data, use Reader/Writer. If it's binary data, use InputStream/OutputStream. That's it. They are available in several flavors, like BufferdReader which eases reading a text file line by line and so on.
They're part of the Java IO API. Start learning it here: Java IO tutorial.
By the way, Java at its own really doesn't care about the file extension or format. It's the code logic which you need to write to handle each character or byte of the file according to some file format specification (which you in turn have to writeup first if you'd like to invent one yourself).
I am not sure this directly addresses your question, but since you mentioned a custom file format, it is worth noting that applications launched using Java Web Start can declare a file association. If the user double clicks one of those file types, the file name will be passed to the main(String[]) of the app.
This ability is used in the File Service demo. of the JNLP API - available at my site.
As to the exact format of the file & the best ways to load and save it, there are a large number of possibilities that can be narrowed down with more details of the information it contains.
Choosing a new/existing file extension does not affect your application (or in any case anyone's). It is upto the programmer what files he wants his app to read.
For example, you may consider you can't read a pdf or doc directly as a text file....but that is not because they are written/ stored differently, but because they have headers or characters which your app does not understand. So we might use a plugin or extension which understands those added headers ( or rather the grammar of the pdf /doc file) removes them & lets our app know what text (or anything else) it contains.
So if you wish to incorporate your own extension, & specifically want no other application to be able to read it, just write the text in a way that only your program is able to understand. Though writing a file in binary pretty much ensures that your file is not read directly just by user opening a file, but it is however still possible to read from it, if it is merely collection of raw characters.
If you ask code for hiding a data, I'd say there are plenty of algorithms you might use, which usually get tagged as encryptions cause you are basically trying to lock/hide your stuff. So if you do not really care for the big hulla-bulla, simply trying to keep a file from being directly read & successful attempts to read the file does not cause any harm to your application, write it in binary.