BioNLP stanford - tokenization

BioNLP stanford - tokenization - java

I try to tokenize a biomedical text so I decided to use http://nlp.stanford.edu/software/eventparser.shtml. I used the stand-alone program RunBioNLPTokenizer that does what I want.
Now, I want to create my own program that uses Stanford libraries. So, I read the code from RunBioNLPTokenizer describing below.
package edu.stanford.nlp.ie.machinereading.domains.bionlp;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.util.Collection;
import java.util.List;
import java.util.Properties;
import edu.stanford.nlp.ie.machinereading.GenericDataSetReader;
import edu.stanford.nlp.ie.machinereading.msteventextractor.DataSet;
import edu.stanford.nlp.ie.machinereading.msteventextractor.EpigeneticsDataSet;
import edu.stanford.nlp.ie.machinereading.msteventextractor.GENIA11DataSet;
import edu.stanford.nlp.ie.machinereading.msteventextractor.InfectiousDiseasesDataSet;
import edu.stanford.nlp.io.IOUtils;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.util.StringUtils;
/**
* Standalone program to run our BioNLP tokenizer and save its output
*/
public class RunBioNLPTokenizer extends GenericDataSetReader {
public static void main(String[] args) throws IOException {
Properties props = StringUtils.argsToProperties(args);
String basePath = props.getProperty("base.directory", "/u/nlp/data/bioNLP/2011/originals/");
DataSet dataset = new GENIA11DataSet();
dataset.getFilesystemInformation().setTokenizer("stanford");
runTokenizerForDirectory(dataset, basePath + "genia/training");
runTokenizerForDirectory(dataset, basePath + "genia/development");
runTokenizerForDirectory(dataset, basePath + "genia/testing");
dataset = new EpigeneticsDataSet();
dataset.getFilesystemInformation().setTokenizer("stanford");
runTokenizerForDirectory(dataset, basePath + "epi/training");
runTokenizerForDirectory(dataset, basePath + "epi/development");
runTokenizerForDirectory(dataset, basePath + "epi/testing");
dataset = new InfectiousDiseasesDataSet();
dataset.getFilesystemInformation().setTokenizer("stanford");
runTokenizerForDirectory(dataset, basePath + "infect/training");
runTokenizerForDirectory(dataset, basePath + "infect/development");
runTokenizerForDirectory(dataset, basePath + "infect/testing");
}
private static void runTokenizerForDirectory(DataSet dataset, String path) throws IOException {
System.out.println("Input directory: " + path);
BioNLPFormatReader reader = new BioNLPFormatReader();
for (File rawFile : reader.getRawFiles(path)) {
System.out.println("Input filename: " + rawFile.getName());
String rawText = IOUtils.slurpFile(rawFile);
String docId = rawFile.getName().replace("." + BioNLPFormatReader.TEXT_EXTENSION, "");
String parentPath = rawFile.getParent();
runTokenizer(dataset.getFilesystemInformation().getTokenizedFilename(parentPath, docId), rawText);
}
}
private static void runTokenizer(String tokenizedFilename, String text) {
System.out.println("Tokenized filename: " + tokenizedFilename);
Collection<String> sentences = BioNLPFormatReader.splitSentences(text);
PrintStream os = null;
try {
os = new PrintStream(new FileOutputStream(tokenizedFilename));
} catch (IOException e) {
System.err.println("ERROR: cannot save online tokenization to " + tokenizedFilename);
e.printStackTrace();
System.exit(1);
}
for (String sentence : sentences) {
BioNLPFormatReader.BioNLPTokenizer tokenizer = new BioNLPFormatReader.BioNLPTokenizer(sentence);
List<CoreLabel> tokens = tokenizer.tokenize();
for (CoreLabel l : tokens) {
os.print(l.word() + " ");
}
os.println();
}
os.close();
}
}
I wrote the below code. I achieved to split the text into sentences but I can't use the BioNLPTokenizer as it is used in RunBioNLPTokenizer.
public static void main(String[] args) throws Exception {
// TODO code application logic here
Collection<String> c =BioNLPFormatReader.splitSentences("..");
for (String sentence : c) {
System.out.println(sentence);
BioNLPFormatReader.BioNLPTokenizer x = BioNLPFormatReader.BioNLPTokenizer(sentence);
}
}
I took this error
Exception in thread "main" java.lang.RuntimeException: Uncompilable source code - edu.stanford.nlp.ie.machinereading.domains.bionlp.BioNLPFormatReader.BioNLPTokenizer has protected access in edu.stanford.nlp.ie.machinereading.domains.bionlp.BioNLPFormatReader
My question is. How can I tokenize a biomedical sentence according to Stanford libraries without using RunBioNLPTokenizer?

Unfortunately, we made BioNLPTokenizer a protected inner class, so you'd need to edit the source and change its access to public.
Note that BioNLPTokenizer may not be the most general purpose biomedical sentence tokenzier -- I would spot check the output to make sure it is reasonable. We developed it heavily against the BioNLP 2009/2011 shared tasks.

Related

Java Exception not understood

I am writing a search engine code using java, and I'm getting this error without knowing the cause:
Exception in thread "main" java.lang.NullPointerException
at WriteToFile.fileWriter(WriteToFile.java:29)
at Main.main(Main.java:14)
Process finished with exit code 1
this is my code :
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Scanner;
public class Search {
private static String URL="https://www.google.com/search?q=";
private Document doc;
private Elements links;
private String html;
public Search() throws IOException {};
public void SearchWeb() throws IOException {
//to get the keywords from the user
Scanner sc = new Scanner(System.in);
System.out.println("Please enter the keyword you want to search for: ");
String word = sc.nextLine();
//Search for the keyword over the net
String url = URL + word;
doc = Jsoup.connect(url).get();
html = doc.html();
Files.write(Paths.get("D:\\OOP\\OOPproj\\data.txt"), html.getBytes());
links = doc.select("cite");
}
public Document getDoc() {
return doc;
}
public String getHtml() {
return html;
}
public Elements getLinks() {
return links;
}
}
and this is the class writeToFile:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.BufferedWriter;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;
import java.util.ArrayList;
public class WriteToFile extends Search {
public WriteToFile() throws IOException {};
String description = "<!> Could not fetch description <!>";
String keywords = "<!> Could not fetch keywords <!>";
private ArrayList<String> detail = new ArrayList<String>();
BufferedWriter bw = null;
public void fileWriter() throws IOException {
for (Element link : super.getLinks()) {
String text = link.text();
if (text.contains("›")) {
text = text.replaceAll(" › ", "/");
}
detail.add(text);
System.out.println(text);
}
System.out.println("***************************************************");
for (int i = 0; i < detail.size(); i++)
System.out.println("detail [" + (i + 1) + "]" + detail.get(i));
System.out.println("###################################################################");
for (int j = 0; j < detail.size(); j++) {
Document document = Jsoup.connect(detail.get(j)).get();
String web = document.html();
Document d = Jsoup.parse(web);
Elements metaTags = d.getElementsByTag("meta");
for (Element metaTag : metaTags) {
String content = metaTag.attr("content");
String name = metaTag.attr("name");
if ("description".equals(name)) {
description = content;
}
if ("keywords".equals(name)) {
keywords = content;
}
}
String title = d.title();
Files.write(Paths.get("D:\\OOP\\OOPproj\\search.txt"), (detail.get(j) + "\t" + "|" + "\t" + title + "\t" + "|" + "\t" + description + "\t" + "|" + "\t" + keywords + System.lineSeparator()).getBytes(), StandardOpenOption.APPEND);
}
}
}
This is the Main class:
import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException {
Search a = new Search();
a.SearchWeb();
WriteToFile b = new WriteToFile();
b.fileWriter();
}
}
I tried to print the getLinks() method in main to check if it was null , but it wasn't , the links were cited.
I would be really grateful if someone helps me out.

You are calling SearchWeb() on object a, but you're calling fileWriter() on object b. This means the links are set in a, but not in b.
Since WriteToFile extends Search, you just need an instance of that:
WriteToFile a = new WriteToFile();
a.SearchWeb();
a.fileWriter();

How to use multiple CSV files in mapreduce

First, I will explain what I am trying to do. First, I am putting input file (first CSV file) into mapreduce job and other CSV file will be put inside mapper class. But here is the thing. The code in mapper class does not work properly, like this right bottom code. I want to combine two CSV files to use several columns in each CSV file.
For example, 1 file has BibNum (user account), checkoutdatetime (book checkoutdatetime), and itemtype (book itemtype), and 2 CSV file has BibNum (user account), Title (book Title), Itemtype and so on. I want to find out which book will be likely borrowed in coming month. I would be really appreciated if you know the way can combine two CSV file and enlighten me with any helps. If you have any doubts for my code, just let me know, I will try to clarify it.
Path p = new Path("hdfs://0.0.0.0:8020/user/training/Inventory_Sample");
FileSystem fs = FileSystem.get(conf);
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(p)));
try {
String BibNum = "Test";
//System.out.print("test");
while(br.readLine() != null){
//System.out.print("test");
if(!br.readLine().startsWith("BibNumber")) {
String subject[] = br.readLine().split(",");
BibNum = subject[0];
}
}
.
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;
import java.util.HashMap;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class StubMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text outkey = new Text();
//private MinMaxCountTuple outTuple = new MinMaxCountTuple();
//String csvFile = "hdfs://user/training/Inventory_Sample";
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
//conf.addResource("/etc/hadoop/conf/core-site.xml");
//conf.addResource("/etc/hadoop/conf/hdfs-site.xml");
Path p = new Path("hdfs://0.0.0.0:8020/user/training/Inventory_Sample");
FileSystem fs = FileSystem.get(conf);
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(p)));
try {
String BibNum = "Test";
//System.out.print("test");
while(br.readLine() != null){
//System.out.print("test");
if(!br.readLine().startsWith("BibNumber")) {
String subject[] = br.readLine().split(",");
BibNum = subject[0];
}
}
if(value.toString().startsWith("BibNumber"))
{
return;
}
String data[] = value.toString().split(",");
String BookType = data[2];
String DateTime = data[5];
SimpleDateFormat frmt = new SimpleDateFormat("MM/dd/yyyy hh:mm:ss a");
Date creationDate = frmt.parse(DateTime);
frmt.applyPattern("dd-MM-yyyy");
String dateTime = frmt.format(creationDate);
//outkey.set(BookType + " " + dateTime);
outkey.set(BibNum + " " + BookType + " " + dateTime);
//outUserId.set(userId);
context.write(outkey, new IntWritable(1));
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
finally{
br.close();
}
}
}

You are reading CSV file in the mapper code.
If you are using the path to open a file in the mapper, I guess you are using Distributed Cache, then only file would be shipped with the jar to each node where the map reduce should run.
There is a way to combine, but not in the mapper.
You can try the below approach :-
1) Write 2 separate mapper for two different file.
2) Send only those fields required from mapper to reducer.
3) Combine the results in the reducer ( as you want to join on some specific key ).
You can check out Multi Input Format examples for more.

List attached devices on ubuntu in java

I'm a little stumped, currently I am trying to list all of the attached devices on my system in linux through a small java app (similar to gparted) I'm working on, my end goal is to get the path to the device so I can format it in my application and perform other actions such as labels, partitioning, etc.
I currently have the following returning the "system root" which on windows will get the appropriate drive (Ex: "C:/ D:/ ...") but on Linux it returns "/" since that is its technical root. I was hoping to get the path to the device (Ex: "/dev/sda /dev/sdb ...") in an array.
What I'm using now
import java.io.File;
class ListAttachedDevices{
public static void main(String[] args) {
File[] paths;
paths = File.listRoots();
for(File path:paths) {
System.out.println(path);
}
}
}
Any help or guidance would be much appreciated, I'm relatively new to SO and I hope this is enough information to cover everything.
Thank you in advance for any help/criticism!
EDIT:
Using part of Phillip's suggestion I have updated my code to the following, the only problem I am having now is detecting if the selected file is related to the linux install (not safe to perform actions on) or an attached drive (safe to perform actions on)
import java.io.File;
import java.io.IOException;
import java.nio.file.FileStore;
import java.nio.file.FileSystems;
import java.util.ArrayList;
import javax.swing.filechooser.FileSystemView;
class ListAttachedDevices{
public static void main(String[] args) throws IOException {
ArrayList<File> dev = new ArrayList<File>();
for (FileStore store : FileSystems.getDefault().getFileStores()) {
String text = store.toString();
String match = "(";
int position = text.indexOf(match);
if(text.substring(position, position + 5).equals("(/dev")){
if(text.substring(position, position + 7).equals("(/dev/s")){
String drivePath = text.substring( position + 1, text.length() - 1);
File drive = new File(drivePath);
dev.add(drive);
FileSystemView fsv = FileSystemView.getFileSystemView();
System.out.println("is (" + drive.getAbsolutePath() + ") root: " + fsv.isFileSystemRoot(drive));
}
}
}
}
}
EDIT 2:
Disregard previous edit, I did not realize this did not detect drives that are not already formatted

Following Elliott Frisch's suggestion to use /proc/partitions I've come up with the following answer. (Be warned this also lists bootable/system drives)
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.logging.Level;
import java.util.logging.Logger;
class ListAttachedDevices{
public static void main(String[] args) throws IOException {
ArrayList<File> drives = new ArrayList<File>();
BufferedReader br = new BufferedReader(new FileReader("/proc/partitions"));
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
String text = line;
String drivePath;
if(text.contains("sd")){
int position = text.indexOf("sd");
drivePath = "/dev/" + text.substring(position);
File drive = new File(drivePath);
drives.add(drive);
System.out.println(drive.getAbsolutePath());
}
line = br.readLine();
}
} catch(IOException e){
Logger.getLogger(ListAttachedDevices.class.getName()).log(Level.SEVERE, null, e);
}
finally {
br.close();
}
}
}

Trying to copy files in specified path with specified extension and replace them with new extension

I have most of it down but when I try to make the copy, no copy is made.
It finds the files in the specified directory like it is supposed to do and I think the copy function executes but there aren't any more files in the specified directory. Any help is appreciated. I made a printf function that isn't shown here. Thanks!
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.util.ArrayList;
import java.util.Scanner;
import org.apache.commons.io.FilenameUtils;
import org.apache.commons.io.FileUtils;
import static java.nio.file.StandardCopyOption.*;
public class Stuff {
static String path, oldExtn, newExtn;
static Boolean delOrig = false;
private static void getPathStuff() {
printf("Please enter the desired path\n");
Scanner in = new Scanner(System.in);
path = in.next();
printf("Now enter the file extension to replace\n");
oldExtn = in.next();
printf("Now enter the file extension to replace with\n");
newExtn = in.next();
in.close();
}
public static void main(String[] args) {
getPathStuff();
File folder = new File(path);
printf("folder = %s\n", folder.getPath());
for (final File fileEntry : folder.listFiles()) {
if (fileEntry.getName().endsWith(oldExtn)) {
printf(fileEntry.getName() + "\n");
File newFile = new File(FilenameUtils.getBaseName(fileEntry
.getName() + newExtn));
try {
printf("fileEntry = %s\n", fileEntry.toPath().toString());
Files.copy(fileEntry.toPath(), newFile.toPath(),
REPLACE_EXISTING);
} catch (IOException e) {
System.err.printf("Exception");
}
}
}
}
}`

The problem is that the new file is created without a full path (only the file name). So your new file is created - only not where you expect...
You can see that it'll work if you'll replace:
File newFile = new File(FilenameUtils.getBaseName(fileEntry
.getName() + newExtn));
with:
File newFile = new File(fileEntry.getAbsolutePath()
.substring(0,
fileEntry.getAbsolutePath()
.lastIndexOf(".")+1) + newExtn);

display console result on java GUI for each class (java newbie programmer)

guys..i has a simple question.
is it possible to display each class console result on java GUI ?
Each class has different console results..
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class SimpleWebCrawler {
public static void main(String[] args) throws IOException {
try {
URL my_url = new URL("http://theworldaccordingtothisgirl.blogspot.com/");
BufferedReader br = new BufferedReader(new InputStreamReader(
my_url.openStream()));
String strTemp = "";
while (null != (strTemp = br.readLine())) {
System.out.println(strTemp);
}
} catch (Exception ex) {
ex.printStackTrace();
}
System.out.println("\n");
System.out.println("\n");
System.out.println("\n");
Validate.isTrue(args.length == 0, "usage: supply url to crawl");
String url = "http://theworldaccordingtothisgirl.blogspot.com/";
print("Fetching %s...", url);
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
System.out.println("\n");
BufferedWriter bw = new BufferedWriter(new FileWriter("abc.txt"));
for (Element link : links) {
print(" %s ", link.attr("abs:href"), trim(link.text(), 35));
bw.write(link.attr("abs:href"));
bw.write(System.getProperty("line.separator"));
}
bw.flush();
bw.close();
}
private static void print(String msg, Object... args) {
System.out.println(String.format(msg, args));
}
private static String trim(String s, int width) {
if (s.length() > width)
return s.substring(0, width - 1) + ".";
else
return s;
}
}
Example output :
Fetching http://theworldaccordingtothisgirl.blogspot.com/...
http://theworldaccordingtothisgirl.blogspot.com/2011/03/in-time-like-this.html
https://lh5.googleusercontent.com/-yz2ql0o45Aw/TYBNhyFVpMI/AAAAAAAAAGU/OrPZrBjwWi8/s1600/Malaysian-Newspaper-Apologises-For-Tsunami-Cartoon.jpg
http://ireport.cnn.com/docs/DOC-571892
https://lh3.googleusercontent.com/-nXOxDT4ZyWA/TX-HaKoHE3I/AAAAAAAAAGQ/xwXJ-8hNt1M/s1600/ScrnShotsDesktop-1213678160_large.png
http://theworldaccordingtothisgirl.blogspot.com/2011/03/in-time-like-this.html#comments
http://www.blogger.com/share-post.g?blogID=3284083343891767749&postID=785884436807581777&target=email
http://www.blogger.com/share-post.g?blogID=3284083343891767749&postID=785884436807581777&target=blog
http://www.blogger.com/share-post.g?blogID=3284083343891767749&postID=785884436807581777&target=twitter
http://www.blogger.com/share-post.g?blogID=3284083343891767749&postID=785884436807581777&target=facebook
http://www.blogger.com/share-post.g?blogID=3284083343891767749&postID=785884436807581777&target=buzz

If you want to separate the standard output (System.out.println) based on which class it comes from, the answer is, no, (not easily at least).
I suggest you let each class that wants to do output get a PrintWriter as argument to the constructor, then use that PrintWriter instead of the System.out print writer.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

BioNLP stanford - tokenization - java

Related

Java Exception not understood

How to use multiple CSV files in mapreduce

List attached devices on ubuntu in java

Trying to copy files in specified path with specified extension and replace them with new extension

display console result on java GUI for each class (java newbie programmer)

Categories

Resources