Extracting specific urls from a text file using java - java

I have a text document in which I have a bunch of urls of the form /courses/......./.../..
and from among these urls, I only want to extract those urls that are of the form /courses/.../lecture-notes. Meaning the urls that begin with /courses and ends with /lecture-notes.
Would anyone know of a good way to do this with regular expressions or just by string matching?

Here's one alternative:
Scanner s = new Scanner(new FileReader("filename.txt"));
String str;
while (null != (str = s.findWithinHorizon("/courses/\\S*/lecture-notes", 0)))
System.out.println(str);
Given a filename.txt with the content
Here /courses/lorem/lecture-notes and
here /courses/ipsum/dolor/lecture-notes perhaps.
the above snippet prints
/courses/lorem/lecture-notes
/courses/ipsum/dolor/lecture-notes

The following will only return the middle part (ie: exclude /courses/ and /lectures-notes/:
Pattern p = Pattern.compile("/courses/(.*)/lectures-notes");
Matcher m = p.matcher(yourStrnig);
if(m.find()).
return m.group(1) // The "1" here means it'll return the first part of the regex between parethesis.

Assuming that you have 1 URL per line, could use:
BufferedReader br = new BufferedReader(new FileReader("urls.txt"));
String urlLine;
while ((urlLine = br.readLine()) != null) {
if (urlLine.matches("/courses/.*/lecture-notes")) {
// use url
}
}

Related

java Jsoup question how can i split by word?

I want to get html content without tags and the result as
word
word
word
So I tried the following.
public class PreProcessing {
public static void main(String\[\] args) throws Exception {
PrintWriter out = new PrintWriter("filename.txt");
URL url = new URL("[https://en.wikipedia.org/wiki/Distributed\_computing](https://en.wikipedia.org/wiki/Distributed_computing)");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine = "";
String input = "";
while ((inputLine = in.readLine()) != null)
{
input += inputLine;
// System.out.println(inputLine);
}
//create Jsoup document from HTML
Document jsoupDoc = Jsoup.parse(input);
//set pretty print to false, so \\n is not removed
jsoupDoc.outputSettings(new OutputSettings().prettyPrint(false));
//select all <br> tags and append \\n after that
// [jsoupDoc.select](https://jsoupDoc.select)("br").after("\\\\n");
//select all <p> tags and prepend \\n before that
// [jsoupDoc.select](https://jsoupDoc.select)("p").before("\\\\n");
//get the HTML from the document, and retaining original new lines
String str = jsoupDoc.html().replaceAll(" ", "\n");
// str.replaceAll("\t", "");
String strWithNewLines = Jsoup.clean(str, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
strWithNewLines.replaceAll("\t", "\n");
strWithNewLines.replaceAll("\\"", "");
strWithNewLines.replaceAll(".", "");
System.out.println(strWithNewLines);
out.print(strWithNewLines);
}
}
This is my code I tried en.wiki~ distributed_computing and read from BufferedReader and use jsoupDoc and I want to replace word " " to "\n" because I want to word \n word\n word\n like this.
Then result is
Distributed
computing
-
Wikipedia Distributed
computing From
Wikipedia,
the
free
encyclopedia Jump
to
navigation Jump
to
search "Distributed
application"
redirects
here.
For
trustless
applications,
see
But I want result like this
Distributed
computing
-
Wikipedia
Distributed
computing
From
Wikipedia
the
free
encyclopedia
Jump
to
navigation
Jump
to
search
Distributed
application
redirects
here
For
trustless
applications
see
I tried like
strWithNewLines.replaceAll("\\"", "");
strWithNewLines.replaceAll(".", "");
But this did not work. Why didn't it work? I did googling but I can't found the solution.
Try this for the last few lines. This will bring you nearer to your desired result:
String strWithNewLines = Jsoup.clean ...;
String result = strWithNewLines.replaceAll("\t", "\n")
.replaceAll("\"", "");
//.replaceAll(".", "");
System.out.println(result);
The problem in your code is that String is immutable, so String.replaceAll will replace nothing in the original String, but produce a new one where the substitiution has been done. But you never use the result.
And there is a problem with .replaceAll(".", ""). This will give you an empty string, because . matches every character and it will be substituted by an empty string.

Java String Matching in a Sorted File and grouping similar data

i have sorted file and i need to do the following pattern match. I read the row and then compare or do patern match with the row just after it , if it matches then insert the string i used to match after a comma in that row and move on to the next row. I am new to Java and overwhelmed with options from Open CSV to BufferedReader. I intend to iterate through the file till it reaches the end. I may always have blanks and have a dated in quotes. The file size would be around 100 MBs.
My file has data like
ABCD
ABCD123
ABCD456, 123
XYZ
XYZ890
XYZ123, 890
and output is expected as
ABCD, ABCD
ABCD123, ABCD
ABCD456, 123, ABCD
XYZ, XYZ
XYZ890, XYZ
XYZ123, 890, XYZ
Not sure about the best method. Can you please help me.
To open a file, you can use File and FileReader classes:
File csvFile = new File("file.csv");
FileReader fileReader = null;
try {
fileReader = new FileReader(csvFile);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
You can get a line of the file using Scanner:
Scanner reader = new Scanner(fileReader);
while(reader.hasNext()){
String line = reader.nextLine();
parseLine(line);
}
You want to parse this line. For it, you have to study Regex for using Pattern and Matcher classes:
private void parseLine(String line) {
Matcher matcher = Pattern.compile("(ABCD)").matcher(line);
if(matcher.find()){
System.out.println("find: " + matcher.group());
}
}
To find the next pattern of the same row, you can reuse matcher.find(). If some result was found, it will return true and you can get this result with matcher.groud();
Read line by line and use regex to replace it as per your need using String.replaceAll()
^([A-Z]+)([0-9]*)(, [0-9]+)?$
Replacement : $1$2$3, $1
Here is Online demo
Read more about Java Pattern
Sample code:
String regex = "^([A-Z]+)([0-9]*)(, [0-9]+)?$";
String replacement = "$1$2$3, $1";
String newLine = line.replaceAll(regex,replacement);
For better performance, read 100 or more lines at a time and store in a buffer and finally call String#replaceAll() single time to replace all at a time.
sample code:
String regex = "([A-Z]+)([0-9]*)(, [0-9]+)?(\r?\n|$)";
String replacement = "$1$2$3, $1$4";
StringBuilder builder = new StringBuilder();
int counter = 0;
String line = null;
try (BufferedReader reader = new BufferedReader(new FileReader("abc.csv"))) {
while ((line = reader.readLine()) != null) {
builder.append(line).append(System.lineSeparator());
if (counter++ % 100 == 0) { // 100 lines
String newLine = builder.toString().replaceAll(regex, replacement);
System.out.print(newLine);
builder.setLength(0); // reset the buffer
}
}
}
if (builder.length() > 0) {
String newLine = builder.toString().replaceAll(regex, replacement);
System.out.print(newLine);
}
Read more about Java 7 - The try-with-resources Statement

Find a string in a very large formatted text file in java

Here is the thing:
I have a really big text file and it has a format like this:
0007476|000011434982|00249626000|R|2008-01-11 00:00:00|9999-12-31 23:59:59|000019.99
0007476|000014017887|00313865000|R|2011-04-19 00:00:00|9999-12-31 23:59:59|000599.99
...
...
And I need to find if a particular pattern exists in the file, say
0007476|whatever|00313865000|whatever
All I need is a boolean saying yes or no.
Now what I have done is to read the file line by line and do a regular expression matching:
Pattern pattern = Pattern.compile(regex);
Scanner scanner = new Scanner(new File(fileName));
String line;
while (scanner.hasNextLine()) {
line = scanner.nextLine();
if (pattern.matcher(line).matches()) {
scanner.close();
return true;
}
}
and the regex has a form of
"0007476\|\d{12}\|0031386500.*
This method works, but it takes usually 15 seconds to search for a string that is far from the start line. Is there a faster way to achieve that? Thanks
The java String class has a contains method which returns a boolean. If your string is fixed, this is a lot faster than a regular expression:
if (string.contains("0007476|") && string.contains("|00313865000|")) {
// whatever
}
Hope that helped, if not, leave a comment.
I assume that you need the Scanner because the file is too big to read into a single String instead?
If that is not the case, you can probably use a regular expression that finds the match directly. Depending on whether or not you care about the specific text at the start of the line you can you something along the lines of:
"(?m)^0007476\|\d{12}\|0031386500.*$
If you do need to break it up into smaller chunks because of memory usage I would suggest not reading on a per line basis, (since the lines are rather short), but process bigger chunks using something like a BufferedReader instead?
I fiddled around a bit with a 1.25GB file and the following is about 2.5 times faster than your implementation:
private static boolean matches() throws IOException {
String regex = "(?m)^0007476\|\d{12}\|0031386500.*$";
Pattern pattern = Pattern.compile(regex);
try(BufferedReader br = new BufferedReader(new FileReader(FILENAME))) {
for(String lines; (lines = readLines(br, 10000)) != null; ) {
if (pattern.matcher(lines).find()) {
return true;
}
}
}
return false;
}
private static String readLines(BufferedReader br, int amount) throws IOException {
StringBuilder builder = new StringBuilder();
int lineCounter = 0;
for(String line; (line = br.readLine()) != null && lineCounter < amount; lineCounter++ ) {
builder.append(line).append(System.lineSeparator());
}
return lineCounter > 0 ? builder.toString() : null;
}

Regular expression illegal character in Java

I've been looking through the Internet an after a big headache, cannon't find why this regular expression is wrong:
"\"\w*&&[\p{Punct}]\"["+sepChar+"]\"\w*&&[\p{Punct}]\""
I'm trying to read a master data file with the following pattern (quotes included):
"TEXTVALUE":"TEXTVALUE":"TEXTVALUE"
and split each line with the regular expression above.
So, for example:
"Hello:John":"Hello:World":"Hello:Mark"
will be splitted into:
{"Hello:John", "Hello:World", "Hello:Mark"}
The backwards slash is the escape character in Java. You need to use two backslashes \\ to include a single backslash in the regex.
Try:
"\"\\w*&&[\\p{Punct}]\"["+sepChar+"]\"\\w*&&[\\p{Punct}]\""
Ok.
Thanks to #kevin-bowersox for the help.
It seems that Oracle has done a great job improving Java with version 7.
With this code:
File file = new File(someFile);
BufferedReader br = new BufferedReader(file);
String line = null;
while((line = br.readLine()) != null){
//todo
}
If your file has been formatted with a constant patern, for example:
"TEXTVALUE":"TEXTVALUE":"TEXTVALUE"
It reads:
"TEXTVALUE-->TEXTVALUE-->TEXTVALUE"
where '-->' stands for tabs ('\t')
So, at the end, my solution is:
public ArrayList getSplittedTextFromFile(String filePath) throws FileNotFoundException, IOException{
ArrayList<String[]> ret = null;
if (!filePath.isEmpty()){
File input = new File(filePath);
BufferedReader br = new BufferedReader(input);
String line = null;
while((line = br.readLine()) != null){
String[] aSplit = line.split("\\t");
if (ret == null)
ret = new ArrayList<>();
ret.add(aSplit);
}//while
}//fi
}//fnc

Java replace content in a link

I need to read the html of a webpage, then find the links and images, then rename the links and images, what i have done
reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), 'UTF-8'));
String line;
while ((line = reader.readLine()) != null) {
regex = "<a[^>]*href=(\"([^\"]*)\"|\'([^\']*)\'|([^\\s>]*))[^>]*>(.*?)</a>";
final Pattern pa = Pattern.compile(regex, Pattern.DOTALL);
final Matcher ma = pa.matcher(s);
if(ma.find()){
string newlink=path+"1-2.html";
//replace the link in href with newlink, how can i do this?
}
html.append(line).append("/r/n");
}
how can i do the comment part
Using regex for parsing HTML can be difficult and unreliable. It's better to use XPath and DOM manipulation for things like that.
Alternatives were mentioned, nevertheless:
Matcher has support to do a "replace all" using a StringBuffer.
The matched text must partially be readded as replacement text, hence all must be in ma.group(1) (2, 3, ...).
DOTALL would let . match newline chars, not needed as using readLine which strips the line end.
There could be more than one link per line.
You had a matcher(s) instead of matcher(line) in the example code.
So the code uses Matcher.appendReplacement and appendTail.
StringBuffer html = new StringBuffer();
reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), 'UTF-8'));
String line;
regex = "(<a[^>]*href=)(\"([^\"]*)\"|\'([^\']*)\'|([^\\s>]*))[^>]*>(.*?)(</a>)";
final Pattern pa = Pattern.compile(regex);
while ((line = reader.readLine()) != null) {
final Matcher ma = pa.matcher(line);
while (ma.find()) {
string newlink=path+"1-2.html";
ma.appendReplacement(html, m.group(1) /* a href */ + ...);
}
ma.appendTail(html);
html.append(line).append("/r/n");
}

Categories