- Mastering Java for Data Science
- Alexey Grigorev
- 317字
- 2021-07-02 23:44:34
Reading input data
Being able to read data is the most important skill for a data scientist, and this data is usually in text format, be it TXT, CSV, or any other format. In Java I/O API, the subclasses of the Reader classes deal with reading text files.
Suppose we have a text.txt file with some sentences (which may or may not make sense):
- My dog also likes eating sausage
- The motor accepts beside a surplus
- Every capable slash succeeds with a worldwide blame
- The continued task coughs around the guilty kiss
If you need to read the whole file as a list of strings, the usual Java I/O way of doing this is using BufferedReader:
List<String> lines = new ArrayList<>();
try (InputStream is = new FileInputStream("data/text.txt")) {
try (InputStreamReader isReader = new InputStreamReader(is,
StandardCharsets.UTF_8)) {
try (BufferedReader reader = new BufferedReader(isReader)) {
while (true) {
String line = reader.readLine();
if (line == null) {
break;
}
lines.add(line);
}
isReader.close();
}
}
}
It is important to provide character encoding--this way, the Reader knows how to translate the sequence of bytes into a proper String object. Apart from UTF-8, there are UTF-16, ISO-8859 (which is ASCII-based text encoding for English), and many others.
There is a shortcut to get BufferedReader for a file directly:
Path path = Paths.get("data/text.txt");
try (BufferedReader reader = Files.newBufferedReader(path,
StandardCharsets.UTF_8)) {
// read line-by-line
}
Even with this shortcut, you can see that this is quite verbose for such a simple task as reading a list of lines from a file. You can wrap this in a helper function, or instead use the Java NIO API, which gives some helper methods to make this task easier:
Path path = Paths.get("data/text.txt");
List<String> lines = Files.readAllLines(path, StandardCharsets.UTF_8);
System.out.println(lines);
The Java NIO shortcuts work only for files. Later, we will talk about shortcuts that work for any InputStream objects, not just files.
- 在你身邊為你設(shè)計Ⅲ:騰訊服務(wù)設(shè)計思維與實戰(zhàn)
- 數(shù)據(jù)之巔:數(shù)據(jù)的本質(zhì)與未來
- InfluxDB原理與實戰(zhàn)
- 正則表達式必知必會
- 企業(yè)大數(shù)據(jù)系統(tǒng)構(gòu)建實戰(zhàn):技術(shù)、架構(gòu)、實施與應(yīng)用
- Libgdx Cross/platform Game Development Cookbook
- Lean Mobile App Development
- 數(shù)據(jù)庫原理與設(shè)計(第2版)
- SQL優(yōu)化最佳實踐:構(gòu)建高效率Oracle數(shù)據(jù)庫的方法與技巧
- 跨領(lǐng)域信息交換方法與技術(shù)(第二版)
- MySQL技術(shù)內(nèi)幕:SQL編程
- Hadoop 3實戰(zhàn)指南
- 計算機視覺
- 數(shù)據(jù)挖掘與機器學(xué)習(xí)-WEKA應(yīng)用技術(shù)與實踐(第二版)
- 區(qū)塊鏈應(yīng)用開發(fā)指南:業(yè)務(wù)場景剖析與實戰(zhàn)