Not entering inside tessract method doOCR(File imageFile)

Not entering inside tessract method doOCR(File imageFile) - java

I have created a small console application to do OCR on a .tiff image file, I have done this using tess4j.
public class JavaApplication10 {
/**
* #param args the command line arguments
*/
public static void main(String[] args)
{
File imageFile = new File("C:\\Users\\Manesh\\Desktop\\license_plate.tiff");
Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
// Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping
try
{
String result = instance.doOCR(imageFile); //Empty result
System.out.println("hahahaha");
System.out.println("The result is: " + result);
}
catch (TesseractException e)
{
System.out.println("error:" + e);
}
}
}
I'm not getting any value inside result, when I looked into the code of Tesseract class and inserted a couple of System.out.println those are also not getting printed in the console. My Tesseract code is given below.
public class Tesseract
{
private static Tesseract instance;
private final static Rectangle EMPTY_RECTANGLE = new Rectangle();
private String language = "eng";
private String datapath = "tessdata";
private int psm = TessAPI.TessPageSegMode.PSM_AUTO;
private boolean hocr;
private int pageNum;
private int ocrEngineMode = TessAPI.TessOcrEngineMode.OEM_DEFAULT;
private Properties prop = new Properties();
public final static String htmlBeginTag =
"<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\""
+ " \"http://www.w3.org/TR/html4/loose.dtd\">\n"
+ "<html>\n<head>\n<title></title>\n"
+ "<meta http-equiv=\"Content-Type\" content=\"text/html;"
+ "charset=utf-8\" />\n<meta name='ocr-system' content='tesseract'/>\n"
+ "</head>\n<body>\n";
public final static String htmlEndTag = "</body>\n</html>\n";
private Tesseract()
{
System.setProperty("jna.encoding", "UTF8");
}
public static synchronized Tesseract getInstance()
{
if (instance == null)
{
instance = new Tesseract();
}
return instance;
}
public void setDatapath(String datapath)
{
this.datapath = datapath;
}
public void setLanguage(String language)
{
this.language = language;
}
public void setOcrEngineMode(int ocrEngineMode)
{
this.ocrEngineMode = ocrEngineMode;
}
public void setPageSegMode(int mode)
{
this.psm = mode;
}
public void setHocr(boolean hocr)
{
this.hocr = hocr;
prop.setProperty("tessedit_create_hocr", hocr ? "1" : "0");
}
public void setTessVariable(String key, String value)
{
prop.setProperty(key, value);
}
public String doOCR(File imageFile) throws TesseractException
{
System.out.println("hiiiiiii "); //not getting printed
return doOCR(imageFile, null);
}
public String doOCR(File imageFile, Rectangle rect) throws TesseractException
{
try
{
System.out.println("be: "); //not getting printed
return doOCR(ImageIOHelper.getIIOImageList(imageFile), rect);
}
catch (IOException ioe)
{
throw new TesseractException(ioe);
}
}
public String doOCR(BufferedImage bi) throws TesseractException
{
return doOCR(bi, null);
}
public String doOCR(BufferedImage bi, Rectangle rect) throws TesseractException
{
IIOImage oimage = new IIOImage(bi, null, null);
List<IIOImage> imageList = new ArrayList<IIOImage>();
imageList.add(oimage);
return doOCR(imageList, rect);
}
public String doOCR(List<IIOImage> imageList, Rectangle rect) throws TesseractException
{
StringBuilder sb = new StringBuilder();
pageNum = 0;
for (IIOImage oimage : imageList)
{
pageNum++;
try
{
ByteBuffer buf = ImageIOHelper.getImageByteBuffer(oimage);
RenderedImage ri = oimage.getRenderedImage();
String pageText = doOCR(ri.getWidth(), ri.getHeight(), buf, rect, ri.getColorModel().getPixelSize());
sb.append(pageText);
}
catch (IOException ioe)
{
//skip the problematic image
System.err.println(ioe.getMessage());
}
}
if (hocr)
{
sb.insert(0, htmlBeginTag).append(htmlEndTag);
}
return sb.toString();
}
public String doOCR(int xsize, int ysize, ByteBuffer buf, Rectangle rect, int bpp) throws TesseractException
{
TessAPI api = TessAPI.INSTANCE;
TessAPI.TessBaseAPI handle = api.TessBaseAPICreate();
api.TessBaseAPIInit2(handle, datapath, language, ocrEngineMode);
api.TessBaseAPISetPageSegMode(handle, psm);
Enumeration em = prop.propertyNames();
while (em.hasMoreElements())
{
String key = (String) em.nextElement();
api.TessBaseAPISetVariable(handle, key, prop.getProperty(key));
}
int bytespp = bpp / 8;
int bytespl = (int) Math.ceil(xsize * bpp / 8.0);
api.TessBaseAPISetImage(handle, buf, xsize, ysize, bytespp, bytespl);
if (rect != null && !rect.equals(EMPTY_RECTANGLE))
{
api.TessBaseAPISetRectangle(handle, rect.x, rect.y, rect.width, rect.height);
}
Pointer utf8Text = hocr ? api.TessBaseAPIGetHOCRText(handle, pageNum - 1) : api.TessBaseAPIGetUTF8Text(handle);
String str = utf8Text.getString(0);
api.TessDeleteText(utf8Text);
api.TessBaseAPIDelete(handle);
return str;
}
}
I'm using tesseract for the first time please tell me what I'm doing wrong.

For Tesseract you have to pass the exact image you want to do OCR on, for example, suppose you are reading the chest numbers of players, if you pass the cropped and gray scaled image of the chest number only it will read the text, where as if you pass the whole image it will not read. You can do this using.
String doOCR(BufferedImage img, Rectangle rect);
Well i'm passing the cropped image directly so I'm not using the above method, My code looks like this rite now.
public class JavaApplication10 {
/**
* #param args the command line arguments
*/
public static void main(String[] args)
{
try
{
File imageFile = new File("C:\\Users\\Manesh\\Desktop\\116.jpg"); //This is a cropped image of a chest number.
BufferedImage img = ImageIO.read(imageFile);
//BufferedImageOp grayscaleConv = new ColorConvertOp(colorFrame.getColorModel().getColorSpace(), grayscaleConv.filter(colorFrame, grayFrame);
Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
ColorSpace cs = ColorSpace.getInstance(ColorSpace.CS_GRAY);
ColorConvertOp op = new ColorConvertOp(cs, null);
op.filter(img, img); // gray scaling the image
// Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping
try
{
String result = instance.doOCR(img);
System.out.println("hahahaha");
System.out.println("The result is: " + result);
}
catch (TesseractException e)
{
System.out.println("error:" + e);
}
}
catch (IOException ex)
{
Logger.getLogger(JavaApplication10.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
This is what I found please feel free to correct me if I'm wrong anywhere.

Related

How to replace figure with placeholder or certain image in word document using apache poi,?

Let's assume i have a word document, with this body.
Word document before replacing images
private void findImages(XWPFParagraph p) {
for (XWPFRun r : p.getRuns()) {
for (XWPFPicture pic : r.getEmbeddedPictures()) {
XWPFPicture picture = pic;
XWPFPictureData source = picture.getPictureData();
BufferedImage qrCodeImage = printVersionService.generateQRCodeImage("JASAW EMA WWS");
File imageFile = new File("image.jpg");
try {
ImageIO.write(qrCodeImage, "jpg", imageFile);
} catch (IOException e) {
e.printStackTrace();
}
try ( FileInputStream in = new FileInputStream(imageFile);
OutputStream out = source.getPackagePart().getOutputStream();
) {
byte[] buffer = new byte[2048];
int length;
while ((length = in.read(buffer)) > 0) {
out.write(buffer, 0, length);
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
}
So this code replaces any image with QR code.
But I have one trouble.
Word Document after replacing
So my question is?
How can I replace only the image i chose or how can i replace inserted figure with text with image generated by my own function?

Detecting the picture and replacing the picture data will be the simplest. In following answer I have shown how to detect and replace pictures by name: Java Apache POI: insert an image "infront the text". If you do not know the name of the embedded picture, a picture also can be detected by alt text. To edit the alt text of a picture, open the context menu by right mouse click on the picture and choose Edit A̲lt Text from that context menu.
In How to read alt text of image in word document apache.poi I have shown already how to read alt text of image.
So code could look like:
import java.io.FileInputStream;
import java.io.OutputStream;
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
public class WordReplacePictureData {
static org.apache.xmlbeans.XmlObject getInlineOrAnchor(org.openxmlformats.schemas.drawingml.x2006.picture.CTPicture ctPictureToFind, org.apache.xmlbeans.XmlObject inlineOrAnchor) {
String declareNameSpaces = "declare namespace pic='http://schemas.openxmlformats.org/drawingml/2006/picture'; ";
org.apache.xmlbeans.XmlObject[] selectedObjects = inlineOrAnchor.selectPath(
declareNameSpaces
+ "$this//pic:pic");
for (org.apache.xmlbeans.XmlObject selectedObject : selectedObjects) {
if (selectedObject instanceof org.openxmlformats.schemas.drawingml.x2006.picture.CTPicture) {
org.openxmlformats.schemas.drawingml.x2006.picture.CTPicture ctPicture = (org.openxmlformats.schemas.drawingml.x2006.picture.CTPicture)selectedObject;
if (ctPictureToFind.equals(ctPicture)) {
// this is the inlineOrAnchor for that picture
return inlineOrAnchor;
}
}
}
return null;
}
static org.apache.xmlbeans.XmlObject getInlineOrAnchor(XWPFRun run, XWPFPicture picture) {
org.openxmlformats.schemas.drawingml.x2006.picture.CTPicture ctPictureToFind = picture.getCTPicture();
for (org.openxmlformats.schemas.wordprocessingml.x2006.main.CTDrawing drawing : run.getCTR().getDrawingList()) {
for (org.openxmlformats.schemas.drawingml.x2006.wordprocessingDrawing.CTInline inline : drawing.getInlineList()) {
org.apache.xmlbeans.XmlObject inlineOrAnchor = getInlineOrAnchor(ctPictureToFind, inline);
// if inlineOrAnchor is not null, then this is the inline for that picture
if (inlineOrAnchor != null) return inlineOrAnchor;
}
for (org.openxmlformats.schemas.drawingml.x2006.wordprocessingDrawing.CTAnchor anchor : drawing.getAnchorList()) {
org.apache.xmlbeans.XmlObject inlineOrAnchor = getInlineOrAnchor(ctPictureToFind, anchor);
// if inlineOrAnchor is not null, then this is the anchor for that picture
if (inlineOrAnchor != null) return inlineOrAnchor;
}
}
return null;
}
static org.openxmlformats.schemas.drawingml.x2006.main.CTNonVisualDrawingProps getNonVisualDrawingProps(org.apache.xmlbeans.XmlObject inlineOrAnchor) {
if (inlineOrAnchor == null) return null;
if (inlineOrAnchor instanceof org.openxmlformats.schemas.drawingml.x2006.wordprocessingDrawing.CTInline) {
org.openxmlformats.schemas.drawingml.x2006.wordprocessingDrawing.CTInline inline = (org.openxmlformats.schemas.drawingml.x2006.wordprocessingDrawing.CTInline)inlineOrAnchor;
return inline.getDocPr();
} else if (inlineOrAnchor instanceof org.openxmlformats.schemas.drawingml.x2006.wordprocessingDrawing.CTAnchor) {
org.openxmlformats.schemas.drawingml.x2006.wordprocessingDrawing.CTAnchor anchor = (org.openxmlformats.schemas.drawingml.x2006.wordprocessingDrawing.CTAnchor)inlineOrAnchor;
return anchor.getDocPr();
}
return null;
}
static String getSummary(org.openxmlformats.schemas.drawingml.x2006.main.CTNonVisualDrawingProps nonVisualDrawingProps) {
if (nonVisualDrawingProps == null) return "";
String summary = "Id:=" + nonVisualDrawingProps.getId();
summary += " Name:=" + nonVisualDrawingProps.getName();
summary += " Title:=" + nonVisualDrawingProps.getTitle();
summary += " Descr:=" + nonVisualDrawingProps.getDescr();
return summary;
}
static XWPFPicture getPictureByAltText(XWPFRun run, String altText) {
if (altText == null) return null;
for (XWPFPicture picture : run.getEmbeddedPictures()) {
String altTextSummary = getSummary(getNonVisualDrawingProps(getInlineOrAnchor(run, picture)));
System.out.println(altTextSummary);
if (altTextSummary.contains(altText)) {
return picture;
}
}
return null;
}
static void replacePictureData(XWPFPictureData source, String pictureResultPath) {
try ( FileInputStream in = new FileInputStream(pictureResultPath);
OutputStream out = source.getPackagePart().getOutputStream();
) {
byte[] buffer = new byte[2048];
int length;
while ((length = in.read(buffer)) > 0) {
out.write(buffer, 0, length);
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
static void replacePicture(XWPFRun run, String altText, String pictureResultPath) {
XWPFPicture picture = getPictureByAltText(run, altText);
if (picture != null) {
XWPFPictureData source = picture.getPictureData();
replacePictureData(source, pictureResultPath);
}
}
public static void main(String[] args) throws Exception {
String templatePath = "./source.docx";
String resultPath = "./result.docx";
String altText = "Placeholder QR-Code";
String pictureResultPath = "./QR.jpg";
try ( XWPFDocument document = new XWPFDocument(new FileInputStream(templatePath));
FileOutputStream out = new FileOutputStream(resultPath);
) {
for (IBodyElement bodyElement : document.getBodyElements()) {
if (bodyElement instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph)bodyElement;
for (XWPFRun run : paragraph.getRuns()) {
replacePicture(run, altText, pictureResultPath);
}
}
}
document.write(out);
}
}
}
This replaces the picture or pictures having alt text "Placeholder QR-Code". All other pictures remain as they are.
Replacing shapes with pictures is very laborious as shapes are stored in alternate content elements (to choice shape and fallback) and so the shape needs to be changed as well as the fallback. If one would let the fallback untouched, then applications which rely on that fallback will further show the old shape. Furthermore detecting shapes by text box content is not really much simpler than detecting pictures by alt text content.

How to add remaining batch of n elements into arrayList?

I'm currently learning to develop a simple blockchain program that reads sample data from .txt and creates a new block for every 10 transactions. I was wondering if the given sample data was 23 lines of transactions, is there a way to make a new block that consist of the last 3 transactions ?
Current Output
Block[header=Header[index=0,currHash=51aa6b7cf5fb821189d58b5c995b4308370888efcaac469d79ad0a5d94fb0432, prevHash=0, timestamp=1654785847112], tranx=null]
Block[header=Header[index=0,currHash=92b3582095e2403c68401448e8a34864e8465d0ea51c05f11c23810ec36b4868, prevHash=0, timestamp=1654785847385], tranx=Transaction [tranxLst=[alice|bob|credit|1.0, alice|bob|debit|2.0, alice|bob|debit|3.0, alice|bob|credit|4.0, alice|bob|debit|5.0, alice|bob|credit|6.0, alice|bob|debit|7.0, alice|bob|debit|8.0, alice|bob|debit|9.0, alice|bob|debit|10.0]]]
Block[header=Header[index=0,currHash=7488c600433d78e0fb8586e71a010b1d39a040cb101cc6e3418668d21b614519, prevHash=0, timestamp=1654785847386], tranx=Transaction [tranxLst=[alice|bob|credit|11.0, alice|bob|credit|12.0, alice|bob|debit|13.0, alice|bob|debit|14.0, alice|bob|credit|15.0, alice|bob|credit|16.0, alice|bob|credit|17.0, alice|bob|debit|18.0, alice|bob|credit|19.0, alice|bob|credit|20.0]]]
What I want
Block[header=Header[index=0,currHash=51aa6b7cf5fb821189d58b5c995b4308370888efcaac469d79ad0a5d94fb0432, prevHash=0, timestamp=1654785847112], tranx=null]
Block[header=Header[index=0,currHash=92b3582095e2403c68401448e8a34864e8465d0ea51c05f11c23810ec36b4868, prevHash=0, timestamp=1654785847385], tranx=Transaction [tranxLst=[alice|bob|credit|1.0, alice|bob|debit|2.0, alice|bob|debit|3.0, alice|bob|credit|4.0, alice|bob|debit|5.0, alice|bob|credit|6.0, alice|bob|debit|7.0, alice|bob|debit|8.0, alice|bob|debit|9.0, alice|bob|debit|10.0]]]
Block[header=Header[index=0,currHash=7488c600433d78e0fb8586e71a010b1d39a040cb101cc6e3418668d21b614519, prevHash=0, timestamp=1654785847386], tranx=Transaction [tranxLst=[alice|bob|credit|11.0, alice|bob|credit|12.0, alice|bob|debit|13.0, alice|bob|debit|14.0, alice|bob|credit|15.0, alice|bob|credit|16.0, alice|bob|credit|17.0, alice|bob|debit|18.0, alice|bob|credit|19.0, alice|bob|credit|20.0]]]
Block[header=Header[index=0,currHash=7488c600433d78e0fb8586e71a010b1d39a040cb101cc6e3418668d21b614520, prevHash=0, timestamp=1654785847387], tranx=Transaction [tranxLst=[alice|bob|credit|21.0, alice|bob|credit|22.0, alice|bob|debit|23.0]]]
my code:
Client app
public static void main(String[] args) throws IOException {
homework();
}
static void homework() throws IOException {
int count = 0;
Transaction tranxLst = new Transaction();
Block genesis = new Block("0");
System.out.println(genesis);
BufferedReader bf = new BufferedReader(new FileReader("dummytranx.txt"));
String line = bf.readLine();
while (line != null) {
tranxLst.add(line);
line = bf.readLine();
count++;
if (count % 10 == 0) {
Block newBlock = new Block(genesis.getHeader().getPrevHash());
newBlock.setTranx(tranxLst);
System.out.println(newBlock);
tranxLst.getTranxLst().clear();
}
}
bf.close();
}
Transaction class
public class Transaction implements Serializable {
public static final int SIZE = 10;
/**
* we will comeback to generate the merkle root ie., hash of merkle tree
* merkleRoot = hash
*/
private String merkleRoot = "9a0885f8cd8d94a57cd76150a9c4fa8a4fed2d04c244f259041d8166cdfeca1b8c237b2c4bca57e87acb52c8fa0777da";
// private String merkleRoot;
public String getMerkleRoot() {
return merkleRoot;
}
public void setMerkleRoot(String merkleRoot) {
this.merkleRoot = merkleRoot;
}
/**
* For the data collection, u may want to choose classic array or collection api
*/
private List<String> tranxLst;
public List<String> getTranxLst() {
return tranxLst;
}
public Transaction() {
tranxLst = new ArrayList<>(SIZE);
}
/**
* add()
*/
public void add(String tranx) {
tranxLst.add(tranx);
}
#Override
public String toString() {
return "Transaction [tranxLst=" + tranxLst + "]";
}
}
Block class
public class Block implements Serializable {
private Header header;
public Header getHeader() {
return header;
}
private Transaction tranx;
public Block(String previousHash) {
header = new Header();
header.setTimestamp(new Timestamp(System.currentTimeMillis()).getTime());
header.setPrevHash(previousHash);
String blockHash = Hasher.sha256(getBytes());
header.setCurrHash(blockHash);
}
/**
* getBytes of the Block object
*/
private byte[] getBytes() {
try (ByteArrayOutputStream baos = new ByteArrayOutputStream();
ObjectOutputStream out = new ObjectOutputStream(baos);) {
out.writeObject(this);
return baos.toByteArray();
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
public Transaction getTranx() {
return tranx;
}
/**
* aggregation rel
*/
public void setTranx(Transaction tranx) {
this.tranx = tranx;
}
/**
* composition rel
*/
public class Header implements Serializable {
private int index;
private String currHash, prevHash;
private long timestamp;
// getset methods
public String getCurrHash() {
return currHash;
}
public int getIndex() {
return index;
}
public void setIndex(int index) {
this.index = index;
}
public void setCurrHash(String currHash) {
this.currHash = currHash;
}
public String getPrevHash() {
return prevHash;
}
public void setPrevHash(String prevHash) {
this.prevHash = prevHash;
}
public long getTimestamp() {
return timestamp;
}
public void setTimestamp(long timestamp) {
this.timestamp = timestamp;
}
#Override
public String toString() {
return "Header [index=" + index + ", currHash=" + currHash + ", prevHash=" + prevHash + ", timestamp="
+ timestamp + "]";
}
}
#Override
public String toString() {
return "Block [header=" + header + ", tranx=" + tranx + "]";
}
}
enter code here

Instead of using a counter in the conditional statement, try ForLoop.
static void homework() throws IOException {
Transaction tranxLst = new Transaction();
Block genesis = new Block("0");
System.out.println(genesis);
BufferedReader bf = new BufferedReader(new FileReader("dummytranx.txt"));
String line = bf.readLine();
while (line != null) {
for (int i = 0; i < 10; i++) {
tranxLst.add(line);
line = bf.readLine();
if (line == null) {
break;
}
}
Block newBlock = new Block(genesis.getHeader().getPrevHash());
newBlock.setTranx(tranxLst);
System.out.println(newBlock);
tranxLst.getTranxLst().clear();
}
bf.close();
}

Apache Hbase MapReduce job take too much time while reading the datastore

I have setup Apache Hbase, Nutch and Hadoop cluster. I have crawled few documents i.e., about 30 Million. There are 3 workers in the cluster and 1 master. I have write my own Hbase mapreduce job to read crawled data and change some score little bit based on some logic.
For this purpose, I have combined the documents of same domain and found their effective bytes and found some score. Later, in reducer, I have assigned that score to each URL of that domain (via cache). This portion of job takes took much time i.e., 16 hours. Following is the code snippet
for ( int index = 0; index < Cache.size(); index++) {
String Orig_key = Cache.get(index);
float doc_score = log10;
WebPage page = datastore.get(Orig_key);
if ( page == null ) {
continue;
}
page.setScore(doc_score);
if (mark) {
page.getMarkers().put( Queue, Q1);
}
context.write(Orig_key, page);
}
If I remove that document read statement from datastore then job is finished in 2 to 3 hours only. That why, I think the statement WebPage page = datastore.get(Orig_key); is causing this problem. Is'nt it ?
If that is the case then what is best approach. The Cache object is simply a list that contains URLs of same domain.
DomainAnalysisJob.java
...
...
public class DomainAnalysisJob implements Tool {
public static final Logger LOG = LoggerFactory
.getLogger(DomainAnalysisJob.class);
private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
private Configuration conf;
protected static final Utf8 URL_ORIG_KEY = new Utf8("doc_orig_id");
protected static final Utf8 DOC_DUMMY_MARKER = new Utf8("doc_marker");
protected static final Utf8 DUMMY_KEY = new Utf8("doc_id");
protected static final Utf8 DOMAIN_DUMMY_MARKER = new Utf8("domain_marker");
protected static final Utf8 LINK_MARKER = new Utf8("link");
protected static final Utf8 Queue = new Utf8("q");
private static URLNormalizers urlNormalizers;
private static URLFilters filters;
private static int maxURL_Length;
static {
FIELDS.add(WebPage.Field.STATUS);
FIELDS.add(WebPage.Field.LANG_INFO);
FIELDS.add(WebPage.Field.URDU_SCORE);
FIELDS.add(WebPage.Field.MARKERS);
FIELDS.add(WebPage.Field.INLINKS);
}
/**
* Maps each WebPage to a host key.
*/
public static class Mapper extends GoraMapper<String, WebPage, Text, WebPage> {
#Override
protected void setup(Context context) throws IOException ,InterruptedException {
Configuration conf = context.getConfiguration();
urlNormalizers = new URLNormalizers(context.getConfiguration(), URLNormalizers.SCOPE_DEFAULT);
filters = new URLFilters(context.getConfiguration());
maxURL_Length = conf.getInt("url.characters.max.length", 2000);
}
#Override
protected void map(String key, WebPage page, Context context)
throws IOException, InterruptedException {
String reversedHost = null;
if (page == null) {
return;
}
if ( key.length() > maxURL_Length ) {
return;
}
String url = null;
try {
url = TableUtil.unreverseUrl(key);
url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT);
url = filters.filter(url); // filter the url
} catch (Exception e) {
LOG.warn("Skipping " + key + ":" + e);
return;
}
if ( url == null) {
context.getCounter("DomainAnalysis", "FilteredURL").increment(1);
return;
}
try {
reversedHost = TableUtil.getReversedHost(key.toString());
}
catch (Exception e) {
return;
}
page.getMarkers().put( URL_ORIG_KEY, new Utf8(key) );
context.write( new Text(reversedHost), page );
}
}
public DomainAnalysisJob() {
}
public DomainAnalysisJob(Configuration conf) {
setConf(conf);
}
#Override
public Configuration getConf() {
return conf;
}
#Override
public void setConf(Configuration conf) {
this.conf = conf;
}
public void updateDomains(boolean buildLinkDb, int numTasks) throws Exception {
NutchJob job = NutchJob.getInstance(getConf(), "rankDomain-update");
job.getConfiguration().setInt("mapreduce.task.timeout", 1800000);
if ( numTasks < 1) {
job.setNumReduceTasks(job.getConfiguration().getInt(
"mapred.map.tasks", job.getNumReduceTasks()));
} else {
job.setNumReduceTasks(numTasks);
}
ScoringFilters scoringFilters = new ScoringFilters(getConf());
HashSet<WebPage.Field> fields = new HashSet<WebPage.Field>(FIELDS);
fields.addAll(scoringFilters.getFields());
StorageUtils.initMapperJob(job, fields, Text.class, WebPage.class,
Mapper.class);
StorageUtils.initReducerJob(job, DomainAnalysisReducer.class);
job.waitForCompletion(true);
}
#Override
public int run(String[] args) throws Exception {
boolean linkDb = false;
int numTasks = -1;
for (int i = 0; i < args.length; i++) {
if ("-rankDomain".equals(args[i])) {
linkDb = true;
} else if ("-crawlId".equals(args[i])) {
getConf().set(Nutch.CRAWL_ID_KEY, args[++i]);
} else if ("-numTasks".equals(args[i]) ) {
numTasks = Integer.parseInt(args[++i]);
}
else {
throw new IllegalArgumentException("unrecognized arg " + args[i]
+ " usage: updatedomain -crawlId <crawlId> [-numTasks N]" );
}
}
LOG.info("Updating DomainRank:");
updateDomains(linkDb, numTasks);
return 0;
}
public static void main(String[] args) throws Exception {
final int res = ToolRunner.run(NutchConfiguration.create(),
new DomainAnalysisJob(), args);
System.exit(res);
}
}
DomainAnalysisReducer.java
...
...
public class DomainAnalysisReducer extends
GoraReducer<Text, WebPage, String, WebPage> {
public static final Logger LOG = DomainAnalysisJob.LOG;
public DataStore<String, WebPage> datastore;
protected static float q1_ur_threshold = 500.0f;
protected static float q1_ur_docCount = 50;
public static final Utf8 Queue = new Utf8("q"); // Markers for Q1 and Q2
public static final Utf8 Q1 = new Utf8("q1");
public static final Utf8 Q2 = new Utf8("q2");
#Override
protected void setup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
try {
datastore = StorageUtils.createWebStore(conf, String.class, WebPage.class);
}
catch (ClassNotFoundException e) {
throw new IOException(e);
}
q1_ur_threshold = conf.getFloat("domain.queue.threshold.bytes", 500.0f);
q1_ur_docCount = conf.getInt("domain.queue.doc.count", 50);
LOG.info("Conf updated: Queue-bytes-threshold = " + q1_ur_threshold + " Queue-doc-threshold: " + q1_ur_docCount);
}
#Override
protected void cleanup(Context context) throws IOException, InterruptedException {
datastore.close();
}
#Override
protected void reduce(Text key, Iterable<WebPage> values, Context context)
throws IOException, InterruptedException {
ArrayList<String> Cache = new ArrayList<String>();
int doc_counter = 0;
int total_ur_bytes = 0;
for ( WebPage page : values ) {
// cache
String orig_key = page.getMarkers().get( DomainAnalysisJob.URL_ORIG_KEY ).toString();
Cache.add(orig_key);
// do not consider those doc's that are not fetched or link URLs
if ( page.getStatus() == CrawlStatus.STATUS_UNFETCHED ) {
continue;
}
doc_counter++;
int ur_score_int = 0;
int doc_ur_bytes = 0;
int doc_total_bytes = 0;
String ur_score_str = "0";
String langInfo_str = null;
// read page and find its Urdu score
langInfo_str = TableUtil.toString(page.getLangInfo());
if (langInfo_str == null) {
continue;
}
ur_score_str = TableUtil.toString(page.getUrduScore());
ur_score_int = Integer.parseInt(ur_score_str);
doc_total_bytes = Integer.parseInt( langInfo_str.split("&")[0] );
doc_ur_bytes = ( doc_total_bytes * ur_score_int) / 100; //Formula to find ur percentage
total_ur_bytes += doc_ur_bytes;
}
float avg_bytes = 0;
float log10 = 0;
if ( doc_counter > 0 && total_ur_bytes > 0) {
avg_bytes = (float) total_ur_bytes/doc_counter;
log10 = (float) Math.log10(avg_bytes);
log10 = (Math.round(log10 * 100000f)/100000f);
}
context.getCounter("DomainAnalysis", "DomainCount").increment(1);
// if average bytes and doc count, are more than threshold then mark as q1
boolean mark = false;
if ( avg_bytes >= q1_ur_threshold && doc_counter >= q1_ur_docCount ) {
mark = true;
for ( int index = 0; index < Cache.size(); index++) {
String Orig_key = Cache.get(index);
float doc_score = log10;
WebPage page = datastore.get(Orig_key);
if ( page == null ) {
continue;
}
page.setScore(doc_score);
if (mark) {
page.getMarkers().put( Queue, Q1);
}
context.write(Orig_key, page);
}
}
}
In my testing and debugging, I have found that the statement WebPage page = datastore.get(Orig_key); is major cause of too much time. It took about 16 hours to complete the job but when I replaced this statement with WebPage page = WebPage.newBuilder().build(); the time was reduced to 6 hours. Is this due to IO ?

Why am I getting null as the destination?

When I have the program print out System.out.println(_spaces.get("classroom").toStringLong()); it spits back "classroom: a large lecture hall with a door that goes null to sidewalk." Why does it say a door that goes to null? I think I have to fix my _buildPortals method, but I'm not sure how.
public class ConfigLoader
{
private Ini _ini;
private HashMap<String, Space> _spaces = new HashMap<String, Space>();
private HashMap<String, Portal> _portals = new HashMap<String, Portal>();
private HashMap<String, Agent> _agents = new HashMap<String, Agent>();
public ConfigLoader(File iniFile)
{
_ini = new Ini(iniFile);
}
public Agent buildAll()
{
_buildSpaces();
_buildPortals();
_buildExits();
_buildDestinations();
System.out.println(_spaces.get("classroom").toStringLong());
_buildAgents();
//return _selectStartAgent();
return null;
}
private void _buildSpaces()
{
for(String spaceName : _ini.keys("spaces"))
{
String description = _ini.get("spaces", spaceName);
String image = _ini.get("images", "images");
Space spaceInstance = new Space(spaceName, description, null, image);
_spaces.put(spaceName, spaceInstance);
}
}
private void _buildPortals()
{
for(String portalName : _ini.keys("portals"))
{
String description = _ini.get("portal", portalName);
Portal portalInstance = new Portal(portalName, description, null);
_portals.put(portalName, portalInstance);
}
}
private void _buildExits()
{
for(String spaceName : _ini.keys("exits"))
{
String spaceExit = _ini.get("exits", spaceName);
Space space = _spaces.get(spaceName);
Portal exit = _portals.get(spaceExit);
space.setPortal(exit);
}
}
private void _buildDestinations()
{
for(String portalName : _ini.keys("destinations"))
{
String destination = _ini.get("destinations", portalName);
Space dest = _spaces.get(destination);
Portal portal = _portals.get(portalName);
if(dest == null)
{
System.out.println("Error");
System.exit(1);
}
else
{
portal.setDestination(dest);
}
}
}
private void _buildAgents()
{
for(String agentName : _ini.keys("agents"))
{
String agent = _ini.get("agents", agentName);
Space space = _spaces.get(agent);
if(space == null)
{
System.out.println("Error");
System.exit(1);
}
else
{
Agent a = new Agent(space, agentName);
_agents.put(agentName, a);
}
}
}
private Agent _selectStartAgent()
{
for(String agentName : _ini.keys("start"))
{
String agent = _ini.get("start", agentName);
Agent agentInstance = _agents.get(agent);
if(agent == null)
{
System.out.println("Error");
System.exit(1);
}
else
{
return agentInstance;
}
}
return null;
}
}

Following the other patterns in your code, maybe..
String description = _ini.get("portal", portalName);
needs to be
String description = _ini.get("portals", portalName);
If so, it's usually a good idea to extract something like this to a string constant.
private static final String PORTALS = "portals";
and use that in multiple places.

Your buildSpaces method's second line is wrong. You're getting the image associated with a certain space but you have two strings in your get call and that's not right.

How to call another class in Jframe form class?

I have a Jframe form class like this
public class LoginForm extends javax.swing.JFrame
In this,i get username & password from user and then send it to php server for validation and
will get the response as OK or Invalid User. I have another class named 'public class LoginTimer implements Runnable' . In this class i have some code to execute. I want that in 'LoginForm' when i got response as OK, the control will move to second class 'LoginTimer' means second class will be
called. please tell me how to do it??
=====================================================================
private void sendGet(String username,String pwd) throws Exception
{
String url = "http://localhost/login.php?username="+username+ "&password="+pwd;
final String USER_AGENT = "Mozilla/5.0";
URL obj = new URL(url);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
con.setRequestMethod("GET");
con.setRequestProperty("User-Agent", USER_AGENT);
int responseCode = con.getResponseCode();
System.out.println("\nSending 'GET' request to URL : " + url);
System.out.println("Response Code : " + responseCode);
BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null)
{
response.append(inputLine);
}
in.close();
//print result
String r=response.toString();
System.out.println("String "+r);
if(r.equals("OK"))
{
System.out.println("you are a valid user");
}
else
{
System.out.println("You are an invalid user");
}
}
Below is my code for LoginTimer class. In this, I am getting names of visible windows and then thread starts and in run() method i call sendGet() method for sending window names to php server page. I want that when I got the OK response in LoginForm class,the LoginTimer class will be called and executed automatically.I mean, when user logged in & verified then sending of window names to php server will start automatically.
public class LoginTimer implements Runnable
{
LoginTimer lk1;
String s3;
static int arraySize=10;
static int arrayGrowth=2;
static String[] m=new String[arraySize];
static int count=0;
#Override
public void run()
{
for(int ck=0;ck<3;ck++)
{File file=new File("G:\\check.txt");
Scanner scanner = null;
try
{
scanner = new Scanner(file);
}
catch (FileNotFoundException ex)
{
Logger.getLogger(LoginTimer.class.getName()).log(Level.SEVERE, null, ex);
}
while(scanner.hasNext())
{
String[] tokens = scanner.nextLine().split(":");
String last = tokens[1];
// System.out.println(last);
if(last!=null)
{
try
{
lk1.sendGet(last,m);
}
catch (Exception ex)
{
Logger.getLogger(LoginTimer.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
try {
Thread.sleep(5000);
} catch (InterruptedException ex) {
Logger.getLogger(LoginTimer.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
public static void main(String[] args)
{
(new Thread(new LoginTimer())).start();
final List<WindowInfo> inflList=new ArrayList<WindowInfo>();
final List<Integer> order=new ArrayList<Integer>();
int top = User32.instance.GetTopWindow(0);
while (top!=0)
{
order.add(top);
top = User32.instance.GetWindow(top, User32.GW_HWNDNEXT);
}
User32.instance.EnumWindows(new WndEnumProc()
{
public boolean callback(int hWnd, int lParam)
{
if (User32.instance.IsWindowVisible(hWnd))
{
RECT r = new RECT();
User32.instance.GetWindowRect(hWnd, r);
if (r.left>-32000)
{ // minimized
byte[] buffer = new byte[1024];
User32.instance.GetWindowTextA(hWnd, buffer, buffer.length);
String title = Native.toString(buffer);
//lk1.getid(title);
if (m.length == count)
{
// expand list
m = Arrays.copyOf(m, m.length + arrayGrowth);
}
m[count]=Native.toString(buffer);
System.out.println("title===="+m[count]);
count++;
inflList.add(new WindowInfo(hWnd, r, title));
}
}
return true;
}
}, 0);
Collections.sort(inflList, new Comparator<WindowInfo>()
{
public int compare(WindowInfo o1, WindowInfo o2)
{
return order.indexOf(o1.hwnd)-order.indexOf(o2.hwnd);
}
});
for (WindowInfo w : inflList)
{
System.out.println(w);
}
}
public static interface WndEnumProc extends StdCallLibrary.StdCallCallback
{
boolean callback (int hWnd, int lParam);
}
public static interface User32 extends StdCallLibrary
{
final User32 instance = (User32) Native.loadLibrary ("user32", User32.class);
boolean EnumWindows (WndEnumProc wndenumproc, int lParam);
boolean IsWindowVisible(int hWnd);
int GetWindowRect(int hWnd, RECT r);
void GetWindowTextA(int hWnd, byte[] buffer, int buflen);
int GetTopWindow(int hWnd);
int GetWindow(int hWnd, int flag);
final int GW_HWNDNEXT = 2;
}
public static class RECT extends Structure
{
public int left,top,right,bottom;
}
public static class WindowInfo
{
int hwnd;
RECT rect;
String title;
public WindowInfo(int hwnd, RECT rect, String title)
{
this.hwnd = hwnd; this.rect = rect; this.title = title;
}
public String toString()
{
return String.format("(%d,%d)-(%d,%d) : \"%s\"",
rect.left ,rect.top,rect.right,rect.bottom,title);
}
}
public static void sendGet(String last1,String[] get) throws Exception
{
for(int t=0;t<get.length;t++)
{
if(get[t]!=null)
{
String url = "http://localhost/add_windows.php?username="+last1+"&windowname="+get[t];
final String USER_AGENT = "Mozilla/5.0";
URL obj = new URL(url);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
con.setRequestMethod("GET");
con.setRequestProperty("User-Agent", USER_AGENT);
int responseCode = con.getResponseCode();
System.out.println("\nSending 'GET' request to URL : " + url);
System.out.println("Response Code : " + responseCode);
BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null)
{
response.append(inputLine);
}
in.close();
String r=response.toString();
System.out.println("String "+r);
}
}
}
}

As u are implementing runnable class you are creating thread. So create an object of LoginTimer as
LoginTimer lt = new LoginTimer();
in LoginForm class after you get result from php page.
Now call
lt.start();
after ur creation of object ; which will call ur run method of thread.
Now in ur LoginTimer class override the run method like
class LoginTimer implements Runnable
{
public void run()
{
//put your code which you want to execute now ...
}
}

As your class implements java.lang.Runnable.
To have the run() method executed by a thread, pass an instance of your class_implementing_Runnable to a Thread in its constructor.Something like
Thread thread = new Thread(new LoginTimer());
thread.start();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Not entering inside tessract method doOCR(File imageFile) - java

Related

How to replace figure with placeholder or certain image in word document using apache poi,?

How to add remaining batch of n elements into arrayList?

Apache Hbase MapReduce job take too much time while reading the datastore

Why am I getting null as the destination?

How to call another class in Jframe form class?

Categories

Resources