I am trying to create a dictionary-based tagger running on a Hadoop cluster using Pig. Basically, what it does, is for each document (quite large text documents, up to a few MBs) to run each word in each sentence against the dictionary to read the corresponding value.
There will be up to a few hundred java programs (not threads) running in parallel, using the dictionary file in read-only mode. The idea is to load the dictionary from text and create a Map to query against it.
Question: what should I be prepared for? Is it even remotely logic to want to read a file in a multiprogramming environment or should I first copy the (relatively small) file for each instance of the program? Is a BufferedReader something I should use while reading the file?
There is very little structured documentation on multiprogramming (compared to multithreading) so I am a bit afraid of running against a wall by doing so.
Note: you are only allowed to answer that my way of thinking is totally wrong if you provide me with a better way ;-)
I think your approach is fine. You should load your dictionary from the DistributedCache to memory, and do the checks with the memory-loaded dictionary (e.g., a HashMap).
Related
I am working on a project where I have extracted images from sensor and saved them to the operating system directory. I have a Java API for uploading images to the server.
I need to upload these images and some other data typically float data type to the main server.
I need to decide an inter-mediator such as a database where I store those images and make connection through java to upload them or use HDFS.
Can some body please advise me, which option will be best for storing images? Database or HDFS?
Note: Images are up to 150 thousand can be more in future.
I think the best way to do that is to keep the floats you need and metadata of the images in the database. For easier searching and querying and easier interaction with the Java. The actual images are best stored on a file system to decrease the transformation from and to the database. I believe a simple file system would be good enough for that size of images. You probably won't use any of the fancy HDFS functions like map reduce and stuff like that. But that's up to you.
So in this case if a standard file system isn't good enough for you and you want something bigger then HDFS is the way to go. So the proper way would be a mixture of the two.
It totally depends on the usecase , you can choose
HDFS : when you wanna read them as a whole or transfer or process them to do any manipulation upon the images data and store or do someother action based on the processed results. In simple, if you wanna do Map-Reduce operation. And reading images in HDFS is sequentially , if you wanna perform to fetch particular image based on certain selection criteria, then it costly and performance impacted operations.
Database : It is better for query based operation where you wanna query or do DML operations upon images on certain criteria basis, In simple, WHERE conditions. But this is totally time consuming process, when you wanna process as a chunk. And the performance will be obviously very slow as you wanna store 150thousand of images
So My suggestion based on the requirement, you wanna store images as intermediate, it will be better to store in HDFS itself.
150.000 images is not considered a huge amount today. If an average of 10 MB is assumed for each image (uncompressed) the amount of data is 1.5 TB, which should be possible to store in an off-the-shelf database (with off-the-shelf hardware, i.e. a Linux box with some RAID disks) like postgreSQL. I'm no expert in HDFS even though I tried products in the same family as HDFS I find them easy to use, I guess you could try Hadoop then for processing of the images as well if you are looking for a way to parallelize the processing. Even though this product family is nice I would still use a standard database like postgreSQL if parallelisation is not really needed by nature (like you get in HDFS).
I have a dictionary file that is being used for word matching, the java code is to be submitted online and get executed.(for a online coding competition)
How would I be able to use the dictionary data file, while my program executes online.
could it be embedded in the source code as compressed byte stream?
please suggest
There are multiple ways to achieve this:
either refer to the dictionairy file as a remote resource in your code. This means that you ll most your dictionary file on a different online location which is well known by your online application code. You can then download the dictionary file and cache the file in memory for usage
You can encode the dictionary file (for instance in Base64 encoding - to take care of special characters in the dictonary file) as a predefined datastructure / buffer in your code. This means however that you need to convert your dictionary file & rebuild your application each time you adapt the dictionary file.
Pointing to a different "online" location would seem to more suitable solution.
I'm writing a Java program and I'd like to convert a ogg file into mp3 file.
I've spend a lot of time trying to find a good library to do that, but without success for the moment.
I think I'll need a ogg decoder (jorbis ?) and a mp3 encoder (lameOnJ ?).
Moreover, once the conversion is done, I need to set some tags in the file (artist/track tag, etc).
This is a windows and OS X app.
Could you give me any hint about how to process, with examples if possible.
Thanks
You have lots of choices, and it depends on how much effort you want to put in, and what constraints you have regarding the execution platform.
Many developers would simply make System.exec() calls to external decode/encode/label executables, writing the intermediate files to disk. This is slightly clunky, but once it's set up properly, it works.
A more sophisticated option is to use libraries such as the ones you've found. You can still use the filesystem to temporarily store the uncompressed version.
You can, however, avoid storing the intermediate step -- and maybe make it faster -- by pipelining. You need to feed the output of the decoder as the input of the encoder, and set them both going.
The details of this depends on the API. If you're lucky, they can work with chunks, and you may be able to manage them in a single thread.
If they work with streams, you might need to get your hands dirty and work with threads. One thread for the encoder, one for the decoder.
Is it possible to have a Java EE application (based on Spring Framework, running in Tomcat container) persisting its data in a file on the server?
The scenario is as follows: I have a class with an int field (read from ?? during startup). I want to save it to a file in a safe manner (as safe as possible, meaning surviving server crash would be appreciated). Is it possible (besides naive file reading/writing)
Kind regards,
q
Really the only "safe" way to do it is to rely on the underlying file system.
Simply:
public void saveThing(Serializable thing, String fileName) throws Exception {
String tempFileName = fileName + "_tmp";
File tempFile = new File(tempFileName);
FileOutputStream fos = new FileOutputStream(tempFile);
FileDescriptor fd = fos.getFD();
ObjectOutputStream oos = new ObjectOutputStream(fos);
oos.writeObject(thing);
oos.flush();
fd.sync();
oos.close();
f.renameTo(fileName);
}
What's happening here is first we're writing the file to a temporary file. This ensures that the entire file write succeeds without damaging the original file (for example, if you run out of disk space, the original will be retained as this routine will not finish). However if this routine fails, the lingering temp file will remain, and will need to be cleaned up later.
Once we've written the file, we force the OS to flush any pending writes to the actual disk. Many systems buffer file system writes to ram, and "eventually" write them out to disk. This is for obvious performance reasons. However, should the system crash or lose power between when you closed the file, and the OS decides to flush the writes, you can potentially lose data. This sync is an EXPENSIVE operation.
Finally, once we are sure that we have written the file, and that it is committed to disk (as sure as we can be anyway), we then RENAME the temp file to the actual file name.
Renaming a file on the file system is an atomic operation. It's can't partially fail. It either works, or it doesn't. If the two files are on the same file system, the rename is near instantaneous since it simply updates some file system information. If the two are on separate file systems, then the new file must be copied first to the new file system, and then renamed. I ASSUME this is how it is done, I never tested this. I tend to stick to the same file system and avoid the question completely.
This process ensures that the file will be updated, under the correct name, completely, "all at once". The file (under its correct name) never only "partially exists", which is what would happen if you were to simply overwrite the existing file.
Finally, on Windows you may have a problem if there is contention for the original file, since Windows will not delete a file that is opened by something else. Unix has no problem doing this, but Windows does. So you need to ensure through some external means that you have sole access to the file before doing this rename procedure.
The short answer is yes. I actually had to do just that for a project that I did with a university a while back. I posted the code for it on my git hub: Speak To Me project. In that Web app, I persisted user data to file in plain text so it was both human readable and easy to for objects to reinitialize themselves.
So readers of this question might be wondering why I didn't use a database for these purposes. Well the university that I was working with didn't want to support one. As well, this app had really low traffic; it is a research prototype for testing search interfaces so it was only used for user studies. Finally, because of the nature of the application, persisting to file keep things really simple. In fact, the data files were later used for post study analyses. Plus it kept the option open for students who were not great coders to get their feet wet (that... never happened).
Anyhow, my recommendation is that if you are just persisting simple values, then plain text will be fine. If your data has any amount of complexity, then use JSON. XML is a bit heavyweight and really should only be used if your application is large but in that scenario, you shouldn't be persisting to file.
It may be on overkill for your situation, but you could use HSQLDB. You can configure it to persist in a file.
For a simpler solution, you can always write/read from a file. Some issues worth of consideration:
Use JNDI or a system variable to store the name and path of the file.
Make sure that the user that runs the server has read/write access to the file.
Other than that you can use standard Java File operations
You can use the serializable interface in Java to create persistent objects that you can save and reload from disk.
I'm working on a java project and i have to read some files like these:
- EntryID.data
- EntryID.index
- KeyText.data
- KeyText.index
...
I think these files are used in a dictionary project but i can't find a any document about this. How can i read them or know the format of them ? Sorry for my english =.=
Thanks alot!
This looks like files from a database management system. One file to store the data, another one to store at least one index to speed up queries.
I'd start with a hex editor and look at the file. Sometimes, the content binaries gives a hint.
Another idea: look at the classpath and inspect property and resource files. Maybe you'll find a database driver or some config files with jdbc connect strings.
Google told me, that all four files are used by Apple's Dictionary.app. Have a look at this blog, this can point you in the correct direction.
Last note - reading undocumented binaries is a challenge. I usually start with 010 Editor to analyse the datastructure and develop a java based test tool to read the data. It's some sort of try and error evolutionary process.
Well, this is kinda difficult. data could mean anything.
You could try the UNIX utility file or open the file with a hex editor and look for interesting strings (the utility strings is helpful for that too).
Some information is in info.plist.
KeyText.data is sometimes compressed using zlib. 78 9C is well-known zlib-header so you can decompress when you find it. Size of decompressed entry comes before compressed entry.
Size of entry comes before entry of array.
C# library is in https://github.com/kurema/MacDictionaryGeneral. But *.index is too difficult to understand and implement. info.plist says *.index is trie index which is not enough information to understand fully.