Search and replce a string in 10000 files - java

EDIT: I have 10,000 identical files with below text in it.
TRANSACTION_ID=9093626660000000001,VAULT_REPORT_NAME=VUpldr_QA_1Mb_Report00001,DIMENSION=REGION:Europe;LOB:All LOB;CATEGORY:RO Reporting;CUSTOMER:All Customer;FREQUENCY:Daily;REPORT AUDIENCE:Apple RO Reporting;REPORT SUBSCRIPTION:Apple RO Reporting
My requirement is to replace as below in 10 k files.
TRANSACTION_ID from 9093626660000000001 to 9093626660000010000.
and
VAULT_REPORT_NAME from VUpldr_QA_1Mb_Report00001 to
VUpldr_QA_1Mb_Report10000
So my output files contents would be
file 1st:
TRANSACTION_ID=9093626660000000001,VAULT_REPORT_NAME=VUpldr_QA_1Mb_Report00001,DIMENSION=REGION:Europe;LOB:All LOB;CATEGORY:RO Reporting;CUSTOMER:All Customer;FREQUENCY:Daily;REPORT AUDIENCE:Apple RO Reporting;REPORT SUBSCRIPTION:Apple RO Reporting
file 2nd:
TRANSACTION_ID=9093626660000000002,VAULT_REPORT_NAME=VUpldr_QA_1Mb_Report00002,DIMENSION=REGION:Europe;LOB:All LOB;CATEGORY:RO Reporting;CUSTOMER:All Customer;FREQUENCY:Daily;REPORT AUDIENCE:Apple RO Reporting;REPORT SUBSCRIPTION:Apple RO Reporting
file 10000th:
TRANSACTION_ID=9093626660000010000,VAULT_REPORT_NAME=VUpldr_QA_1Mb_Report10000,DIMENSION=REGION:Europe;LOB:All LOB;CATEGORY:RO Reporting;CUSTOMER:All Customer;FREQUENCY:Daily;REPORT AUDIENCE:Apple RO Reporting;REPORT SUBSCRIPTION:Apple RO Reporting
I wrote below code but that isn't working:
import java.io.*;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class mdScript {
public static void main(String[] args) throws IOException {
for (int i=1;i<=10000;i++){
StringBuffer pathVarBuf = new StringBuffer();
pathVarBuf.append("/Users/564169/Desktop/Vault_Testing/vUpldr/MD/1MbReport");
pathVarBuf.append(i);
pathVarBuf.append(".md");
//System.out.println(pathVarBuf);
Path path = Paths.get(pathVarBuf.toString());
Charset charset = StandardCharsets.UTF_8;
String content = new String(Files.readAllBytes(path), charset);
content = content.replaceAll("VUpldr_QA_1Mb_Report00001", "RVUpldr_QA_1Mb_Report"+i);
int id=999900001; // (Transaction id in the org MD file is 90)
id = id+i;
content = content.replaceAll("999900001", Integer.toString(id));
Files.write(path, content.getBytes(charset));
System.out.println(content);
}
}
}
I am getting below error:
Exception in thread "main" java.lang.NullPointerException
at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:267)
at java.nio.file.Paths.get(Paths.java:84)
at mdScript.main(mdScript.java:22)

There are many ways to do this. You can do it programmatically as well.
Simple solution will be using TextPad
Here is a link for the same
TextPad find and replace in files

Related

Unexpected record type (org.apache.poi.hssf.record.HyperlinkRecord)

The problem:
I'm just trying to open it .xls file using the Apache-poi 4.1.0 library and it gives the same error as 4 years ago in a similar question.
I already tried
to put version 3.12-3.16.
3.13 as well
All versions can open blank .xls and filled by myself but not this one.
This document is generated automatically and I need to make a program that accepts it.
I already made a .Net standart library C# which is work, I tried to use xamarin android it's a horror, the app weighs 50 mb vs 3 mb due to various terrible SDK link errors, but that's a different story. So I decided to do it on Kotlin.
Code is from the documentation
You can check file on git
val inputStream = FileInputStream("./test.xls")
val wb = HSSFWorkbook(inputStream)
I expect no errors while opening xls.
Actual output is
Exception in thread "main" java.lang.RuntimeException: Unexpected record type (org.apache.poi.hssf.record.HyperlinkRecord)
at org.apache.poi.hssf.record.aggregates.RowRecordsAggregate.<init>(RowRecordsAggregate.java:97)
at org.apache.poi.hssf.model.InternalSheet.<init>(InternalSheet.java:183)
at org.apache.poi.hssf.model.InternalSheet.createSheet(InternalSheet.java:122)
at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:354)
at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:400)
at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:381)
at ru.plumber71.toolbox.ExcelParcerKt.main(ExcelParcer.kt:19)
at ru.plumber71.toolbox.ExcelParcerKt.main(ExcelParcer.kt)
The document will not be modified in any way. If there any other libraries to just read the dataset or strings from the .xls file will be OK.
After some investigation I found the problem with your test.xls file.
According the file format specifications, all HyperlinkRecords should be together in the Hyperlink Table. It is contained in the Sheet Substream following the cell records. In your case the HyperlinkRecords are between other records (between NumberRecords and LabelSSTRecords in that case). So I suspect it was not Excel what had created that test.xls file.
Excelmight be tolerant enough to open that file nevertheless. But you cannot expect that apache poi also tries to tolerate all possible violations in file format. If you open the file using Excel and then re-save it, apache poi is able creating the Workbookafter that.
Apache poi is not able repairing this as Excel can do. But one could read the POIFSFileSystem a low level way and filtering out the HyperlinkRecords that are between other records. That way one could read the content using apache poi, of course except the hyperlinks.
Example:
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.poifs.filesystem.DirectoryNode;
import org.apache.poi.hssf.record.Record;
import org.apache.poi.hssf.record.NameRecord;
import org.apache.poi.hssf.record.NameCommentRecord;
import org.apache.poi.hssf.record.HyperlinkRecord;
import org.apache.poi.hssf.record.RecordFactoryInputStream;
import org.apache.poi.hssf.record.RecordFactory;
import org.apache.poi.hssf.model.RecordStream;
import org.apache.poi.hssf.model.InternalWorkbook;
import org.apache.poi.hssf.model.InternalSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFName;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.util.CellReference;
import java.util.List;
import java.util.ArrayList;
import java.lang.reflect.Field;
import java.lang.reflect.Method;
import java.lang.reflect.Constructor;
class ExcelOpenHSSF {
public static void main(String[] args) throws Exception {
String fileName = "test(2).xls";
try (InputStream is = new FileInputStream(fileName);
POIFSFileSystem fileSystem = new POIFSFileSystem(is)) {
//find workbook directory entry
DirectoryNode directory = fileSystem.getRoot();
String workbookName = "";
for(String wbName : InternalWorkbook.WORKBOOK_DIR_ENTRY_NAMES) {
if(directory.hasEntry(wbName)) {
workbookName = wbName;
break;
}
}
InputStream stream = directory.createDocumentInputStream(workbookName);
//loop over all records and manipulate if needed
List<Record> records = new ArrayList<Record>();
RecordFactoryInputStream recStream = new RecordFactoryInputStream(stream, true);
//here we filter out the HyperlinkRecords that are between other records (NumberRecords and LabelSSTRecords in that case)
//System.out.println prints the problematic records
Record record1 = null;
Record record2 = null;
while ((record1 = recStream.nextRecord()) != null) {
record2 = recStream.nextRecord();
if (!(record1 instanceof HyperlinkRecord) && (record2 instanceof HyperlinkRecord)) {
System.out.println(record1);
System.out.println(record2);
records.add(record1);
} else if ((record1 instanceof HyperlinkRecord) && !(record2 instanceof HyperlinkRecord)) {
System.out.println(record1);
System.out.println(record2);
records.add(record2);
} else {
records.add(record1);
if (record2 != null) records.add(record2);
}
}
//now create the HSSFWorkbook
//see https://svn.apache.org/viewvc/poi/tags/REL_4_1_0/src/java/org/apache/poi/hssf/usermodel/HSSFWorkbook.java?view=markup#l322
InternalWorkbook internalWorkbook = InternalWorkbook.createWorkbook(records);
HSSFWorkbook wb = HSSFWorkbook.create(internalWorkbook);
int recOffset = internalWorkbook.getNumRecords();
Method convertLabelRecords = HSSFWorkbook.class.getDeclaredMethod("convertLabelRecords", List.class, int.class);
convertLabelRecords.setAccessible(true);
convertLabelRecords.invoke(wb, records, recOffset);
RecordStream rs = new RecordStream(records, recOffset);
while (rs.hasNext()) {
InternalSheet internelSheet = InternalSheet.createSheet(rs);
Constructor constructor = HSSFSheet.class.getDeclaredConstructor(HSSFWorkbook.class, InternalSheet.class);
constructor.setAccessible(true);
HSSFSheet hssfSheet = (HSSFSheet)constructor.newInstance(wb, internelSheet);
Field _sheets = HSSFWorkbook.class.getDeclaredField("_sheets");
_sheets.setAccessible(true);
#SuppressWarnings("unchecked")
List<HSSFSheet> sheets = (ArrayList<HSSFSheet>)_sheets.get(wb);
sheets.add(hssfSheet);
}
for (int i = 0 ; i < internalWorkbook.getNumNames() ; ++i){
NameRecord nameRecord = internalWorkbook.getNameRecord(i);
Constructor constructor = HSSFName.class.getDeclaredConstructor(HSSFWorkbook.class, NameRecord.class, NameCommentRecord.class);
constructor.setAccessible(true);
HSSFName name = (HSSFName)constructor.newInstance(wb, nameRecord, internalWorkbook.getNameCommentRecord(nameRecord));
Field _names = HSSFWorkbook.class.getDeclaredField("names");
_names.setAccessible(true);
#SuppressWarnings("unchecked")
List<HSSFName> names = (ArrayList<HSSFName>)_names.get(wb);
names.add(name);
}
//now the workbook is created properly
System.out.println(wb);
/*
//getting the data
DataFormatter formatter = new DataFormatter();
Sheet sheet = wb.getSheetAt(0);
for (Row row : sheet) {
for (Cell cell : row) {
CellReference cellRef = new CellReference(row.getRowNum(), cell.getColumnIndex());
System.out.print(cellRef.formatAsString());
System.out.print(" - ");
String text = formatter.formatCellValue(cell);
System.out.println(text);
}
}
*/
}
}
}
I was able to open a file of this "corrupted" type by using JExcel API
But using poi.apache.org also opens the file if manually resave it using excel application. (It may not be suitable for someone)
Sorry that it was asking strange questions. Thank you all and hope that someone may find useful.
val inputStream = FileInputStream("./testCorrupted.xls")
val workbook = Workbook.getWorkbook(inputStream)
val sheet = workbook.getSheet(0)
val cell1 = sheet.getCell(0, 0)
print(cell1.contents + ":")

Delete specific rownumbers in .csv file using java

For ex: I am trying search a text with name "abc"in .csv file which is present in column no 6 in multiple rows and I need to delete those rows.
I tried below code. I am able to get the line no/row no where text "abc" is present in column 6 but it is not deleting the rows.
import java.io.BufferedReader;
import java.io.*;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.List;
import com.opencsv.CSVReader;
import com.opencsv.CSVWriter;
public class ReadExcel {
public static void main(String[] args) throws Exception{
String csvFile = "csv filelocation";
CSVReader reader = new CSVReader(new FileReader(csvFile));
List<String[]> allElements = reader.readAll();
String [] nextLine;
int lineNumber = 0;
while ((nextLine = reader.readNext()) != null) {
lineNumber++;
if(nextLine[5].equalsIgnoreCase("abc")){
System.out.println("Line # " + lineNumber);
allElements.remove(lineNumber);
}
}
For reading the files in CSV format, I am currently using the library super-csv. There are various examples.
Let me know if you need help to use it.
So, if you would like to use the opencsv library, I start a new example for writing the new content in a CSV file. I take inspiration from your example code.
List<String[]> allElements; /* This list will contain the lines that cover your criteria */
/*
...
*/
CSVWriter writer = new CSVWriter(new FileWriter("yourfile.csv"));
writer.writeAll(allElements);
writer.close();

OCR implementation using using java

I have written java code to convert images into text using java.But my code is taking only single image as input . I want that the program should fetch images from a folder and then run the OCR on it.
My code is:
import java.io.FileOutputStream;
import org.bytedeco.javacpp.*;
import org.junit.Test;
import static org.bytedeco.javacpp.lept.*;
import static org.bytedeco.javacpp.tesseract.*;
import static org.junit.Assert.assertTrue;
import java.io.File;
public class BasicTesseractExampleTest {
#Test
public void givenTessBaseApi_whenImageOcrd_thenTextDisplayed() throws Exception {
BytePointer outText;
TessBaseAPI api = new TessBaseAPI();
// Initialize tesseract-ocr with English, without specifying tessdata path
if (api.Init(".", "ENG") != 0) {
System.err.println("Could not initialize tesseract.");
System.exit(1);
}
PIX image = pixRead("IMG_0012 (1).jpg");
api.SetImage(image);
// Get OCR result
outText = api.GetUTF8Text();
String string = outText.getString();
assertTrue(!string.isEmpty())
System.out.println(str);
// Destroy used object and release memory
api.End();
outText.deallocate();
pixDestroy(image);
}
}
To read a list of files out of a given Path use for example:
File f = new File("C:/programs");
File[] fileArray = f.listFiles();
Now you can check every File out of the fileArray if it is a directory and skip that with:
if(fileArray[0].isDirectory()) continue;
To find the images you can check for example the ending of the filename with:
fileArray[0].getName().endsWith(".jpg")
Do this check for all files out ouf the fileArray and call your method with the right files. To check the right file you have to change this line of your code:
PIX image = pixRead("IMG_0012 (1).jpg");
and add the fileArray[?] where the ? must be replaced with the right number.

Get MIME type from dicom files in java

I have tried all the following:
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URLConnection;
import java.nio.file.Files;
public class mimeDicom {
public static void main(String[] argvs) throws IOException{
String path = "Image003.dcm";
String[] mime = new String[3];
File file = new File(path);
mime[0] = Files.probeContentType(file.toPath());
mime[1] = URLConnection.guessContentTypeFromName(file.getName());
InputStream is = new BufferedInputStream(new FileInputStream(file));
mime[2] = URLConnection.guessContentTypeFromStream(is);
for(String m: mime)
System.out.println("mime: " + m);
}
}
But the results are still: mime: null for each of the tried methods above and I really want to know if the file is a DICOM as sometimes they don't have the extension or have a different one.
How can I know if the file is a DICOM from the path?
Note: this is not a duplicate of How to accurately determine mime data from a file? because the excellent list of magic numbers doesn't cover DICOM files and the apache tika gives application/octet-stream as return which doesn't really identify it as an image and it's not useful as the NIfTI files (among others) get the exactly same MIME from Tika.
To determine if a file is Dicom, you best bet is to parse the file yourself and see if it contains the magic bytes "DICM" at the file offset 128.
The first 128 bytes are usually 0 but may contain anything.

(JAVA)Finding a substring in a string which is in UTF-8 encoded format

Say we have a main string contains some text which is in UTF-8 and another string which is a word and this will be in UTF-8 format as well.So please help me to do this in Java.Thank you.
import java.awt.Component;
import java.io.File;
import javax.swing.JFileChooser;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.UnsupportedEncodingException;
import java.io.Writer;
public class Example {
private static Component frame;
public static void main(String args[]) throws FileNotFoundException, IOException{
JFileChooser fc = new JFileChooser();
int returnVal = fc.showOpenDialog(frame); //Where frame is the parent component
File file = null;
if (returnVal == JFileChooser.APPROVE_OPTION) {
file = fc.getSelectedFile();
//Now you have your file to do whatever you want to do
String str = file.getName();
str = "c:\\" + str;
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(str),"UTF8"));
String line;
String wordfname = "c:\\word.txt";
BufferedReader innew = new BufferedReader(new InputStreamReader(new FileInputStream(wordfname),"UTF8"));
String word;
word = innew.readLine();
System.out.println(word);
File fileDir = new File("c:\\test.txt");
Writer out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileDir), "UTF8"));
while((line = in.readLine()) != null)
{
System.out.println(line);
out.append(line).append("\r\n");
boolean r = line.contains(word);
System.out.println(r);
}
out.flush();
out.close();
System.out.println(str);
}
else {
//User did not choose a valid file
}
}
}
Link to the two files are: https://www.dropbox.com/s/4ej0hii6gnlwtga/kannada.txt and https://www.dropbox.com/s/emncfr7bsi8mvwn/word.txt
In fact you did everything fine, apart from some UTF-8 details.
Java Reader/Writer/String handle Unicode.
(Please close the readers too, and flush before close is not needed.)
There is one thing: zero-width combining diacritical marks. Small c-circumflex, ĉ, is one character in the Unicode table, code-point U+0109, java "\u0109", but can also be two Unicode code-points: c, plus a zero-width ^, "e\u0302".
There exists a text normalization in java which transforms into a specific form.
String cCircumflex = "\u0109"; // c^
String cWithCircumflex = "c\u0302"; // c^
String cx = Normalizer.normalize(cCircumflex, Normalizer.Form.NFKC);
String cx2 = Normalizer.normalize(cWithCircumflex, Normalizer.Form.NFKC);
assert cx.equals(cx2);
Which normalisation to chose from is more or less irrelevant, composition (...C) seeming most natural (and gives better font rendering), but decomposition ...D allows natural sorting to be "aäá...cĉ...eé...".
You could even search words, with diacritical marks removed (cafe versus café):
word = Normalizer.normalize(word, Normalizer.Form.NFKD); // Decompose.
word = word.replaceAll("\\p{M}", ""); // Remove diacriticals.
word = word.replaceAll("\\p{C}", ""); // Optional: invisible control characters.
After running the original code
It seems to work with me, without any change (Java 8). Though I had to put kannada.txt on C:\.
ಅದರಲ್ಲಿ
್ರಪಂಚದಲ್ಲಿ ಅನೇಕ ಮಾಧ್ಯಮಗಳು ಇದೆ. ಆಕಾಶವಾಣಿ, ದೂರದರ್ಶನ, ವಾರ್ತಾ ಪತ್ರಿಕೆ ಮುಂತಾದವು ಅದರಲ್ಲಿ ದೂರದರ್ಶನಪ ಪ್ರಮುಖವಾದ ಕಾರ್ಯವನ್ನು ಹೊಂದಿದ್ದು ಅದನ್ನು ಚಿಕ್ಕವರಿಂದ ಹಿಡಿದು ದೊಡ್ಡವರವರೆಗೂ ನೋಡುತ್ತಾರೆ. ಇದಕ್ಕೆ ಇಂಗ್ಲೀಷ್‌ನಲ್ಲಿ ಟೆಲಿವಿಷನ್ ಎಂದು ಚಿಕ್ಕದಾಗಿ ಟಿ.ವಿ. ಎಂದು ಕರೆಯುವ ಬದಲು ಟಿ.ಕೆ. ಎಂದು ಕರೆಯಬೇಕಾಗಿತ್ತು. ಏಕೆಂದರೆ ಇದು ಟೆಲಿವಿಷನ್ ಅಷ್ಟೇ ಅಲ್ಲ ಟೈಮ್ ಕಿಲ್ಲರ್ ಕೂಡ. ಇದನ್ನು ಪ್ರಮುಖವಾಗಿ ವಯಸ್ಸಾದವರು ನೋಡುತ್ತಾರೆ. ಆದರೆ ಕೆಲಸಕ್ಕೆ ಬಂದ ಕೆಲಸದವರು ತಾವು ಕೆಲಸ ಮಾಡುವ ಬದಲು ಮನೆಯಲ್ಲಿ ಕುಳಿತು ನೋಡುತ್ತಾರೆ.
true
false
ನನ್ನ ಪ್ರಕಾರ ಹೇಳಬೇಕಾದರೆ ಡಾಕ್ಷರ್‌ಗಳಿಗೆ ದುಡ್ಡು ಕೊಡುವ ಮಹಾಲಕ್ಷ್ಮಿ ಈ ಟಿ.ವಿ.
false
c:\kannada.txt
String objects actually have fixed UTF-16 encoding.
byte[] has technically no encoding. but you can attach an encoding to byte[] tough. so if you need UTF-8 encoded data, you need a byte[].
so my approach would be
byte[] text = String.getBytes("UTF-8");
to get an UTF-8 byte[]..
IMHO but findeing a substring in a string (which is fully UTF-16!) which is UTF-8 encoded is senseless :)
Thank you all for your help. Now i'm able to find the substring.It worked when i made the text to be on next line in word.txt file and read that word in second readLine() statement.

Categories