jsoup java html parsing

jsoup java html parsing - java

I'm a new french user on stack and I have a problem ^^
I use an HTML parse Jsoup for parsing a html page. For that it's ok but I can't parse more url in same time.
This is my code:
first class for parsing a web page
package test2;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public final class Utils {
public static Map<String, String> parse(String url){
Map<String, String> out = new HashMap<String, String>();
try
{
Document doc = Jsoup.connect(url).get();
doc.select("img").remove();
Elements denomination = doc.select(".AmmDenomination");
Elements composition = doc.select(".AmmComposition");
Elements corptexte = doc.select(".AmmCorpTexte");
for(int i = 0; i < denomination.size(); i++)
{
out.put("denomination" + i, denomination.get(i).text());
}
for(int i = 0; i < composition.size(); i++)
{
out.put("composition" + i, composition.get(i).text());
}
for(int i = 0; i < corptexte.size(); i++)
{
out.put("corptexte" + i, corptexte.get(i).text());
System.out.println(corptexte.get(i));
}
} catch(IOException e){
e.printStackTrace();
}
return out;
}//Fin Methode parse
public static void excelizer(int fileId, Map<String, String> values){
try
{
FileOutputStream out = new FileOutputStream("C:/Documents and Settings/c.bon/git/clinsearch/drugs/src/main/resources/META-INF/test/fichier2.xls" );
Workbook wb = new HSSFWorkbook();
Sheet mySheet = wb.createSheet();
Row row1 = mySheet.createRow(0);
Row row2 = mySheet.createRow(1);
String entete[] = {"CIS", "Denomination", "Composition", "Form pharma", "Indication therapeutiques", "Posologie", "Contre indication", "Mise en garde",
"Interraction", "Effet indesirable", "Surdosage", "Pharmacodinamie", "Liste excipients", "Incompatibilité", "Duree conservation",
"Conservation", "Emballage", "Utilisation Manipulation", "TitulaireAMM"};
for (int i = 0; i < entete.length; i++)
{
row1.createCell(i).setCellValue(entete[i]);
}
Set<String> set = values.keySet();
int rowIndexDenom = 1;
int rowIndexCompo = 1;
for(String key : set)
{
if(key.contains("denomination"))
{
mySheet.createRow(1).createCell(1).setCellValue(values.get(key));
rowIndexDenom++;
}
else if(key.contains("composition"))
{
row2.createCell(2).setCellValue(values.get(key));
rowIndexDenom++;
}
}
wb.write(out);
out.close();
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
second class
package test2;
public final class Task extends Thread {
private static int fileId = 0;
private int id;
private String url;
public Task(String url)
{
this.url = url;
id = fileId;
fileId++;
}
#Override
public void run()
{
Utils.excelizer(id, Utils.parse(url));
}
}
the main class (entry point)
package test2;
import java.util.ArrayList;
public class Main {
public static void main(String[] args)
{
ArrayList<String> urls = new ArrayList<String>();
urls.add("http://base-donnees-publique.medicaments.gouv.fr/affichageDoc.php?specid=61266250&typedoc=R");
urls.add("http://base-donnees-publique.medicaments.gouv.fr/affichageDoc.php?specid=66207341&typedoc=R");
for(String url : urls)
{
new Task(url).run();
}
}
}
When the data was copied to my excel file, the second url doesn't work.
Can you help me solve my problem please?
Thanks

I think its because your main() exits before your second thread has a chance to do its job. You should wait for all spawned threads to complete using Thread.join(). Or better yet, create one of the ExecutorService's and use awaitTermination(...) to block until all URLs are parsed.
EDIT See some examples here http://www.javacodegeeks.com/2013/01/java-thread-pool-example-using-executors-and-threadpoolexecutor.html

Related

Is there any logic to compare two word documents(docx) and catch the missing String, Special chars, space and all the stuff?

I'm working on two word document comparison manually where i should not miss any Strings, Special chars, space and all the stuff and that document is around 150 pages or more. so its very headache to do comparison. Then I have written small java program to compare two documents but I'm not able to list the missing words.
Using Apche POI Library
Thanks in advance.
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import org.apache.poi.xwpf.model.XWPFHeaderFooterPolicy;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFFooter;
import org.apache.poi.xwpf.usermodel.XWPFHeader;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
public class ReadDocFile {
private static XWPFDocument docx;
// private static String path = "C:\\States wise\\NH\\Assessment
// 2nd\\test.docx";
private static ArrayList<String> firstList = new ArrayList<String>(); // refers to first document list
private static ArrayList<String> secondList = new ArrayList<String>(); // refers to second document list
private static List<XWPFParagraph> paragraphList;
private static Map<String, String> map = null;
private static LinkedHashSet<String> firstMissedArray = new LinkedHashSet<String>(); // refers to first document Linked hash set
private static LinkedHashSet<String> secondMissedArray = new LinkedHashSet<String>(); // refers to second document Linked hash set
public static void getFilePath(String path) {
FileInputStream fis;
try {
fis = new FileInputStream(path);
docx = new XWPFDocument(fis);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static void get_First_Doc_Data() {
getFilePath("C:\\States wise\\NH\\Assessment 2nd\\test.docx");
paragraphList = docx.getParagraphs();
System.out.println("******************** first list Starts here ******************** ");
System.out.println();
for (int i = 0; i < paragraphList.size() - 1; i++) {
firstList.add(paragraphList.get(i).getText().toString());
System.out.println(firstList.get(i).toString());
}
System.out.println("*********** first list Ends here ********************");
}
public static void get_Second_Doc_Data() {
getFilePath("C:\\States wise\\NH\\Assessment 2nd\\test1.docx");
paragraphList = docx.getParagraphs();
System.out.println("******************** Second list Starts here ******************** ");
System.out.println();
for (int i = 0; i < paragraphList.size() - 1; i++) {
secondList.add(paragraphList.get(i).getText().toString());
System.out.println(secondList.get(i).toString());
}
System.out.println("*********** Second list Ends here ********************");
}
public static void main(String[] args) {
get_First_Doc_Data();
get_Second_Doc_Data();
//System.out.println("First Para: " + firstList.contains(secondList));
compare();
compare_Two_List();
}
private static void compare() {
String firstMiss = null;
//String secondMiss = null;
for (int i = 0; i < firstList.size(); i++) {
for (int j = 0; j < secondList.size(); j++) {
if (!firstList.get(i).toString().equals(secondList.get(i).toString())) {
firstMiss = firstList.get(i).toString();
//secondMiss = secondList.get(i).toString();
map = new HashMap<String, String>();
}
}
firstMissedArray.add(firstMiss);
//secondMissedArray.add(secondMiss);
// System.out.println(missedArray.get(i).toString());
}
}
private static void compare_Two_List() {
int num = 0;
map.clear();
Iterator<String> first = firstMissedArray.iterator();
//Iterator<String> second = secondMissedArray.iterator();
while (first.hasNext()) {
map.put(""+num, first.next());
num++;
}
System.out.println(firstMissedArray.size());
Iterator it = map.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry) it.next();
System.out.println(pair.getKey() + " = " + pair.getValue());
// it.remove(); // avoids a ConcurrentModificationException
}
}
}

I have taken liberty to modify your code to arrive at the solution for your problem. Please go through this.
This should pretty much solve your problem - put SYSO statements wherever you think is necessary and tweak the flow of the program to achieve desired checks as per you requirement. In my hurry, I may not have made use of coding standards of using try catch block for error handling and handling the negative scenarios, so please take care of that when implementing it live.
In case if the documents are not .DOCX but .PDF make use of the Apache PDFBox api.
Here is the Code:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.TreeMap;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
public class Comapre_Docs {
private static final String FIRST_DOC_PATH = "E:\\Workspace_Luna\\assignments\\Expected.docx";
private static final String SECOND_DOC_PATH = "E:\\Workspace_Luna\\assignments\\Actual.docx";
private static XWPFDocument docx;
private static List<XWPFParagraph> paragraphList;
private static ArrayList<String> firstList = new ArrayList<String>();
private static ArrayList<String> secondList = new ArrayList<String>();
public static void get_Doc_Data(String filePath, ArrayList listName)
throws IOException {
File file = new File(filePath);
FileInputStream fis = new FileInputStream(file);
docx = new XWPFDocument(fis);
paragraphList = docx.getParagraphs();
for (int i = 0; i <= paragraphList.size() - 1; i++) {
listName.add(paragraphList.get(i).getText().toString());
}
fis.close();
}
public static void main(String[] args) throws IOException {
get_Doc_Data(FIRST_DOC_PATH, firstList);
get_Doc_Data(SECOND_DOC_PATH, secondList);
compare(firstList, secondList);
}
private static void compare(ArrayList<String> firstList_1,
ArrayList<String> secondList_1) {
simpleCheck(firstList_1, secondList_1);
int size = firstList_1.size();
for (int i = 0; i < size; i++) {
paragraphCheck(firstList_1.get(i).toString().split(" "),
secondList_1.get(i).toString().split(" "), i);
}
}
private static void paragraphCheck(String[] firstParaArray,
String[] secondParaArray, int paraNumber) {
System.out
.println("=============================================================");
System.out.println("Paragraph No." + (paraNumber + 1) + ": Started");
if (firstParaArray.length != secondParaArray.length) {
System.out.println("There is mismatch of "
+ Math.abs(firstParaArray.length - secondParaArray.length)
+ " words in this paragraph");
}
TreeMap<String, Integer> firstDocPara = getOccurence(firstParaArray);
TreeMap<String, Integer> secondDocPara = getOccurence(secondParaArray);
ArrayList<String> keyData = new ArrayList<String>(firstDocPara.keySet());
for (int i = 0; i < keyData.size(); i++) {
if (firstDocPara.get(keyData.get(i)) != secondDocPara.get(keyData
.get(i))) {
System.out
.println("The following word is missing in actual document : "
+ keyData.get(i));
}
}
System.out.println("Paragraph No." + (paraNumber + 1) + ": Done");
System.out
.println("=============================================================");
}
private static TreeMap<String, Integer> getOccurence(String[] paraArray) {
TreeMap<String, Integer> paragraphStringCountHolder = new TreeMap<String, Integer>();
paragraphStringCountHolder.clear();
for (String a : paraArray) {
int count = 1;
if (paragraphStringCountHolder.containsKey(a)) {
count = paragraphStringCountHolder.get(a) + 1;
paragraphStringCountHolder.put(a, count);
} else {
paragraphStringCountHolder.put(a, count);
}
}
return paragraphStringCountHolder;
}
private static boolean simpleCheck(ArrayList<String> firstList,
ArrayList<String> secondList) {
boolean flag = false;
if (firstList.size() > secondList.size()) {
System.out
.println("There are more paragraph in Expected document than in Actual document");
} else if (firstList.size() < secondList.size()) {
System.out
.println("There are more paragraph in Actual document than in Expected document");
} else if (firstList.size() == secondList.size()) {
System.out.println("The paragraph count in both documents match");
flag = true;
}
return flag;
}
}

Creating an Arrylist in a Arraylist in Java

This is my first post so sorry if I mess something up or if I am not clear enough. I have been looking through online forums for several hours and spend more trying to figure it out for myself.
I am reading information from a file and I need a loop that creates an ArrayList every time it goes through.
static ArrayList<String> fileToArrayList(String infoFromFile)
{
ArrayList<String> smallerArray = new ArrayList<String>();
//This ArrayList needs to be different every time so that I can add them
//all to the same ArrayList
if (infoFromFile != null)
{
String[] splitData = infoFromFile.split(":");
for (int i = 0; i < splitData.length; i++)
{
if (!(splitData[i] == null) || !(splitData[i].length() == 0))
{
smallerArray.add(splitData[i].trim());
}
}
}
The reason I need to do this is that I am creating an app for a school project that reads questions from a delimited text file. I have a loop earlier that reads one line at a time from the text. I will insert that string into this program.
How do I make the ArrayList smallerArray a separate ArrayList everytime it goes through this method?
I need this so I can have an ArrayList of each of these ArrayList

Here is a sample code of what you intend to do -
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Stream;
public class SimpleFileReader {
private static final String DELEMETER = ":";
private String filename = null;
public SimpleFileReader() {
super();
}
public SimpleFileReader(String filename) {
super();
setFilename(filename);
}
public String getFilename() {
return filename;
}
public void setFilename(String filename) {
this.filename = filename;
}
public List<List<String>> getRowSet() throws IOException {
List<List<String>> rows = new ArrayList<>();
try (Stream<String> stream = Files.lines(Paths.get(filename))) {
stream.forEach(row -> rows.add(Arrays.asList(row.split(DELEMETER))));
}
return rows;
}
}
And, here is the JUnit test for the above code -
import static org.junit.jupiter.api.Assertions.fail;
import java.io.IOException;
import java.util.List;
import org.junit.jupiter.api.Test;
public class SimpleFileReaderTest {
public SimpleFileReaderTest() {
super();
}
#Test
public void testFileReader() {
try {
SimpleFileReader reader = new SimpleFileReader("c:/temp/sample-input.txt");
List<List<String>> rows = reader.getRowSet();
int expectedValue = 3; // number of actual lines in the sample file
int actualValue = rows.size(); // number of rows in the list
if (actualValue != expectedValue) {
fail(String.format("Expected value for the row count is %d, whereas obtained value is %d", expectedValue, actualValue));
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

Web scraper not creating CSV file

I have created a web scraper which brings the market data of share rates from the website of stock exchange. www.psx.com.pk in that site there is a hyperlink of Market Summary. From that link I have to scrap the data. I have created a program which is as follows.
package com.market_summary;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.Iterator;
import java.util.Locale;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ComMarket_summary {
boolean writeCSVToConsole = true;
boolean writeCSVToFile = true;
boolean sortTheList = true;
boolean writeToConsole;
boolean writeToFile;
public static Document doc = null;
public static Elements tbodyElements = null;
public static Elements elements = null;
public static Elements tdElements = null;
public static Elements trElement2 = null;
public static String Dcomma = ",";
public static String line = "";
public static ArrayList<Elements> sampleList = new ArrayList<Elements>();
public static void createConnection() throws IOException {
System.setProperty("http.proxyHost", "191.1.1.202");
System.setProperty("http.proxyPort", "8080");
String tempUrl = "http://www.psx.com.pk/index.php";
doc = Jsoup.connect(tempUrl).get();
System.out.println("Successfully Connected");
}
public static void parsingHTML() throws Exception {
File fold = new File("C:\\market_smry.csv");
fold.delete();
File fnew = new File("C:\\market_smry.csv");
for (Element table : doc.getElementsByTag("table")) {
for (Element trElement : table.getElementsByTag("tr")) {
trElement2 = trElement.getElementsByTag("td");
tdElements = trElement.getElementsByTag("td");
FileWriter sb = new FileWriter(fnew, true);
if (trElement.hasClass("marketData")) {
for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
if (it.hasNext()) {
sb.append("\r\n");
}
for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
Element tdElement2 = it.next();
final String content = tdElement2.text();
if (it2.hasNext()) {
sb.append(formatData(content));
sb.append(" | ");
}
}
System.out.println(sb.toString());
sb.flush();
sb.close();
}
}
System.out.println(sampleList.add(tdElements));
}
}
}
private static final SimpleDateFormat FORMATTER_MMM_d_yyyy = new SimpleDateFormat("MMM d, yyyy", Locale.US);
private static final SimpleDateFormat FORMATTER_dd_MMM_yyyy = new SimpleDateFormat("dd-MMM-YYYY", Locale.US);
public static String formatData(String text) {
String tmp = null;
try {
Date d = FORMATTER_MMM_d_yyyy.parse(text);
tmp = FORMATTER_dd_MMM_yyyy.format(d);
} catch (ParseException pe) {
tmp = text;
}
return tmp;
}
public static void main(String[] args) throws IOException, Exception {
createConnection();
parsingHTML();
}
}
Now, the problem is when I execute this program it should create a .csv file but what actually happens is it's not creating any file. When I debug this code I found that program is not entering in the loop. I don't understand that why it is doing so. While when I run the same program on the other website which have slightly different page structure it is running fine.
What I understand that this data is present in the #document which is a virtual element and doesn't mean anything that's why program can't read it while there is no such thing in other website. Kindly, help me out to read the data inside the #document element.

Long Story Short
Change your temp url to http://www.psx.com.pk/phps/index1.php
Explanation
There is no table in the document of http://www.psx.com.pk/index.php.
Instead it is showing it's content in two frameset.
One is dummy with url http://www.psx.com.pk/phps/blank.php.
Another one is the real page which is showing actual data and it's url is
http://www.psx.com.pk/phps/index1.php

Reading an ORC file in Java

How do you read an ORC file in Java? I'm wanting to read in a small file for some unit test output verification, but I can't find a solution.

Came across this and implemented one myself recently
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.io.orc.OrcFile;
import org.apache.hadoop.hive.ql.io.orc.Reader;
import org.apache.hadoop.hive.ql.io.orc.RecordReader;
import org.apache.hadoop.hive.serde2.objectinspector.StructField;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import java.util.List;
public class OrcFileDirectReaderExample {
public static void main(String[] argv)
{
try {
Reader reader = OrcFile.createReader(HdfsFactory.getFileSystem(), new Path("/user/hadoop/000000_0"));
StructObjectInspector inspector = (StructObjectInspector)reader.getObjectInspector();
System.out.println(reader.getMetadata());
RecordReader records = reader.rows();
Object row = null;
//These objects are the metadata for each column. They give you the type of each column and can parse it unless you
//want to parse each column yourself
List fields = inspector.getAllStructFieldRefs();
for(int i = 0; i < fields.size(); ++i) {
System.out.print(((StructField)fields.get(i)).getFieldObjectInspector().getTypeName() + '\t');
}
while(records.hasNext())
{
row = records.next(row);
List value_lst = inspector.getStructFieldsDataAsList(row);
StringBuilder builder = new StringBuilder();
//iterate over the fields
//Also fields can be null if a null was passed as the input field when processing wrote this file
for(Object field : value_lst) {
if(field != null)
builder.append(field.toString());
builder.append('\t');
}
//this writes out the row as it would be if this were a Text tab seperated file
System.out.println(builder.toString());
}
}catch (Exception e)
{
e.printStackTrace();
}
}
}

As per Apache Wiki, ORC file format was introduced in Hive 0.11.
So you will need Hive packages in your project source path to read ORC files. The package for the same are
org.apache.hadoop.hive.ql.io.orc.Reader;
org.apache.hadoop.hive.ql.io.orc.OrcFile

read orc testcase
#Test
public void read_orc() throws Exception {
//todo do kerberos auth
String orcPath = "hdfs://user/hive/warehouse/demo.db/orc_path";
//load hdfs conf
Configuration conf = new Configuration();
conf.addResource(getClass().getResource("/hdfs-site.xml"));
conf.addResource(getClass().getResource("/core-site.xml"));
FileSystem fs = FileSystem.get(conf);
// custom read column
List<String> columns = Arrays.asList("id", "title");
final List<Map<String, Object>> maps = OrcUtil.readOrcFile(fs, orcPath, columns);
System.out.println(new Gson().toJson(maps));
}
OrcUtil to read orc path with special columns
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.stream.Collectors;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;
import org.apache.hadoop.hive.ql.io.orc.OrcFile;
import org.apache.hadoop.hive.ql.io.orc.OrcInputFormat;
import org.apache.hadoop.hive.ql.io.orc.OrcSerde;
import org.apache.hadoop.hive.ql.io.orc.OrcSplit;
import org.apache.hadoop.hive.ql.io.orc.OrcStruct;
import org.apache.hadoop.hive.ql.io.orc.Reader;
import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.objectinspector.StructField;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.InputFormat;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
public class OrcUtil {
public static List<Map<String, Object>> readOrcFile(FileSystem fs, String orcPath, List<String> readColumns)
throws IOException, SerDeException {
JobConf jobConf = new JobConf();
for (Map.Entry<String, String> entry : fs.getConf()) {
jobConf.set(entry.getKey(), entry.getValue());
}
FileInputFormat.setInputPaths(jobConf, orcPath);
FileInputFormat.setInputPathFilter(jobConf, ((PathFilter) path1 -> true).getClass());
InputSplit[] splits = new OrcInputFormat().getSplits(jobConf, 1);
InputFormat<NullWritable, OrcStruct> orcInputFormat = new OrcInputFormat();
List<Map<String, Object>> rows = new ArrayList<>();
for (InputSplit split : splits) {
OrcSplit orcSplit = (OrcSplit) split;
System.out.printf("read orc split %s%n", ((OrcSplit) split).getPath());
StructObjectInspector inspector = getStructObjectInspector(orcSplit.getPath(), jobConf, fs);
List<? extends StructField> readFields = inspector.getAllStructFieldRefs()
.stream().filter(e -> readColumns.contains(e.getFieldName())).collect(Collectors.toList());
// 49B file is empty
if (orcSplit.getLength() > 49) {
RecordReader<NullWritable, OrcStruct> recordReader = orcInputFormat.getRecordReader(orcSplit, jobConf, Reporter.NULL);
NullWritable key = recordReader.createKey();
OrcStruct value = recordReader.createValue();
while (recordReader.next(key, value)) {
Map<String, Object> entity = new HashMap<>();
for (StructField field : readFields) {
entity.put(field.getFieldName(), inspector.getStructFieldData(value, field));
}
rows.add(entity);
}
}
}
return rows;
}
private static StructObjectInspector getStructObjectInspector(Path path, JobConf jobConf, FileSystem fs)
throws IOException, SerDeException {
OrcFile.ReaderOptions readerOptions = OrcFile.readerOptions(jobConf);
readerOptions.filesystem(fs);
Reader reader = OrcFile.createReader(path, readerOptions);
String typeStruct = reader.getObjectInspector().getTypeName();
System.out.println(typeStruct);
List<String> columnList = parseColumnAndType(typeStruct);
String[] fullColNames = new String[columnList.size()];
String[] fullColTypes = new String[columnList.size()];
for (int i = 0; i < columnList.size(); ++i) {
String[] temp = columnList.get(i).split(":");
fullColNames[i] = temp[0];
fullColTypes[i] = temp[1];
}
Properties p = new Properties();
p.setProperty("columns", StringUtils.join(fullColNames, ","));
p.setProperty("columns.types", StringUtils.join(fullColTypes, ":"));
OrcSerde orcSerde = new OrcSerde();
orcSerde.initialize(jobConf, p);
return (StructObjectInspector) orcSerde.getObjectInspector();
}
private static List<String> parseColumnAndType(String typeStruct) {
int startIndex = typeStruct.indexOf("<") + 1;
int endIndex = typeStruct.lastIndexOf(">");
typeStruct = typeStruct.substring(startIndex, endIndex);
List<String> columnList = new ArrayList<>();
List<String> splitList = Arrays.asList(typeStruct.split(","));
Iterator<String> it = splitList.iterator();
while (it.hasNext()) {
StringBuilder current = new StringBuilder(it.next());
String currentStr = current.toString();
boolean left = currentStr.contains("(");
boolean right = currentStr.contains(")");
if (!left && !right) {
columnList.add(currentStr);
continue;
}
if (left && right) {
columnList.add(currentStr);
continue;
}
if (left && !right) {
while (it.hasNext()) {
String next = it.next();
current.append(",").append(next);
if (next.contains(")")) {
break;
}
}
columnList.add(current.toString());
}
}
return columnList;
}
}

Try this for getting ORCFile rowcount...
private long getRowCount(FileSystem fs, String fName) throws Exception {
long tempCount = 0;
Reader rdr = OrcFile.createReader(fs, new Path(fName));
StructObjectInspector insp = (StructObjectInspector) rdr.getObjectInspector();
Iterable<StripeInformation> iterable = rdr.getStripes();
for(StripeInformation stripe:iterable){
tempCount = tempCount + stripe.getNumberOfRows();
}
return tempCount;
}
//fName is hdfs path to file.
long rowCount = getRowCount(fs,fName);

Is this a correct code for fork in join?

I started to read about for in join in Java so I goal is to verify messages in JavaMail which has attatchement.
I´m doubt if it´s rigth. Because I saw some tutorial spliting the array in the 2 parts for each lopp. I put my all class code here below to demonstrate what I did.
package service.forkinjoin;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.concurrent.RecursiveTask;
import javax.mail.Message;
import javax.mail.MessagingException;
import javax.mail.Multipart;
import javax.mail.Part;
import javax.mail.internet.MimeBodyPart;
import service.FileUtil;
public class SepareteMessagesProcess extends RecursiveTask<List<Message>> {
private static final long serialVersionUID = 7126215365819834781L;
private List<Message> listMessages;
private static List<Message> listMessagesToDelete = new ArrayList<>();
private final int threadCount = Runtime.getRuntime().availableProcessors();
public SepareteMessagesProcess(List<Message> listMessages) {
this.listMessages = listMessages;
}
#Override
protected List<Message> compute() {
if (this.listMessages.size() <= threadCount) {
try {
this.separateMessages(this.listMessages);
} catch (MessagingException | IOException e) {
e.printStackTrace();
}
} else {
int[] arrayMaxIndex = this.generateIndexs(this.listMessages);
List<SepareteMessagesProcess> list = this.splitList(this.listMessages, arrayMaxIndex);
invokeAll(list);
}
return listMessagesToDelete;
}
private void separateMessages(List<Message> listMessages) throws MessagingException, IOException {
for (Iterator<Message> iterator = listMessages.iterator(); iterator.hasNext();) {
Message message = (Message) iterator.next();
if ((this.hasNoAttachment(message.getContentType()) || (this.hasNoXmlAttachment(message)))) {
listMessagesToDelete.add(message);
}
}
}
private List<SepareteMessagesProcess> splitList(List<Message> listMessages, int[] arrayMaxIndex) {
List<SepareteMessagesProcess> list = new ArrayList<>(this.threadCount);
int end = 0;
int start = 0;
for (int i = 0; i < this.threadCount; i++) {
end += (arrayMaxIndex[i]);
list.add(new SepareteMessagesProcess(listMessages.subList(start, end)));
start += arrayMaxIndex[i];
}
return list;
}
private int[] generateIndexs(List<Message> listMessages) {
int value = listMessages.size() / this.threadCount;
int[] arrayMaxIndex = new int[this.threadCount];
for (int i = 0; i < this.threadCount; i++) {
arrayMaxIndex[i] = value;
}
arrayMaxIndex[this.threadCount - 1] += listMessages.size() % threadCount;
return arrayMaxIndex;
}
private boolean hasNoAttachment(String content) {
return !content.contains("multipart/MIXED");
}
private boolean hasNoXmlAttachment(Message message) throws IOException, MessagingException {
Multipart multipart = (Multipart) message.getContent();
for (int i = 0; i < multipart.getCount(); i++) {
MimeBodyPart mimeBodyPart = (MimeBodyPart) multipart.getBodyPart(i);
if (Part.ATTACHMENT.equalsIgnoreCase(mimeBodyPart.getDisposition())) {
if (FileUtil.isXmlFile(mimeBodyPart.getFileName())) {
return false;
}
}
}
return true;
}
}

No. Rather than rewrite the tutorial here, just follow the example it gives, such as:
if (problem is small)
directly solve problem
else {
split problem into independent parts
fork new subtasks to solve each part
join all subtasks
compose result from subresults
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

jsoup java html parsing - java

Related

Is there any logic to compare two word documents(docx) and catch the missing String, Special chars, space and all the stuff?

Creating an Arrylist in a Arraylist in Java

Web scraper not creating CSV file

Reading an ORC file in Java

Is this a correct code for fork in join?

Categories

Resources