- Java Data Science Cookbook
- Rushdi Shams
- 438字
- 2021-07-09 18:44:26
Parsing Comma Separated Value (CSV) Files using Univocity
Another very common file type that data scientists handle is Comma Separated Value (CSV) files, where data is separated by commas. CSV files are very popular because they can be read by most of the spreadsheet applications, such as MS Excel.
In this recipe, we will see how we can parse CSV files and handle data points retrieved from them.
Getting ready
In order to perform this recipe, we will require the following:
- Download the Univocity JAR file from http://oss.sonatype.org/content/repositories/releases/com/univocity/univocity-parsers/2.2.1/univocity-parsers-2.2.1.jar. Include the JAR file in your project in Eclipse as external library.
- Create a CSV file from the following data using Notepad. The extension of the file should be
.csv
. You save the file asC:/testCSV.csv
:Year,Make,Model,Description,Price 1997,Ford,E350,"ac, abs, moon",3000.00 1999,Chevy,"Venture ""Extended Edition""","",4900.00 1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00 1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00 ,,"Venture ""Extended Edition""","",4900.00
How to do it...
- Create a method named
parseCsv(String)
that takes the name of the file as a String argument:public void parseCsv(String fileName){
- Then create a settings object. This object provides many configuration settings options:
CsvParserSettings parserSettings = new CsvParserSettings();
- You can configure the parser to automatically detect what line separator sequence is in the input:
parserSettings.setLineSeparatorDetectionEnabled(true);
- Create a
RowListProcessor
that stores each parsed row in a list:RowListProcessor rowProcessor = new RowListProcessor();
- You can configure the parser to use a
RowProcessor
to process the values of each parsed row. You will find moreRowProcessors
in thecom.univocity.parsers.common.processor
package, but you can also create your own:parserSettings.setRowProcessor(rowProcessor);
- If the CSV file that you are going to parse contains headers, you can consider the first parsed row as the headers of each column in the file:
parserSettings.setHeaderExtractionEnabled(true);
- Now, create a
parser
instance with the given settings:CsvParser parser = new CsvParser(parserSettings);
- The
parse()
method will parse the file and delegate each parsed row to theRowProcessor
you defined:parser.parse(new File(fileName));
- If you have parsed the headers, the
headers
can be found as follows:String[] headers = rowProcessor.getHeaders();
- You can then easily process this String array to get the header values.
- On the other hand, the row values can be found in a list. The list can be printed using a for loop as follows:
List<String[]> rows = rowProcessor.getRows(); for (int i = 0; i < rows.size(); i++){ System.out.println(Arrays.asList(rows.get(i))); }
- Finally, close the method:
}
The entire method can be written as follows:
import java.io.File; import java.util.Arrays; import java.util.List; import com.univocity.parsers.common.processor.RowListProcessor; import com.univocity.parsers.csv.CsvParser; import com.univocity.parsers.csv.CsvParserSettings; public class TestUnivocity { public void parseCSV(String fileName){ CsvParserSettings parserSettings = new CsvParserSettings(); parserSettings.setLineSeparatorDetectionEnabled(true); RowListProcessor rowProcessor = new RowListProcessor(); parserSettings.setRowProcessor(rowProcessor); parserSettings.setHeaderExtractionEnabled(true); CsvParser parser = new CsvParser(parserSettings); parser.parse(new File(fileName)); String[] headers = rowProcessor.getHeaders(); List<String[]> rows = rowProcessor.getRows(); for (int i = 0; i < rows.size(); i++){ System.out.println(Arrays.asList(rows.get(i))); } } public static void main(String[] args){ TestUnivocity test = new TestUnivocity(); test.parseCSV("C:/testCSV.csv"); } }
Note
There are many CSV parsers that are written in Java. However, in a comparison, Univocity is found to be the fastest one. See the detailed comparison results here: https://github.com/uniVocity/csv-parsers-comparison
推薦閱讀
- 大規模數據分析和建模:基于Spark與R
- InfluxDB原理與實戰
- 云計算服務保障體系
- Access 2016數據庫技術及應用
- Learn Unity ML-Agents:Fundamentals of Unity Machine Learning
- 區塊鏈:看得見的信任
- INSTANT Cytoscape Complex Network Analysis How-to
- 新基建:數據中心創新之路
- 高維數據分析預處理技術
- Construct 2 Game Development by Example
- 一本書講透Elasticsearch:原理、進階與工程實踐
- 菜鳥學SPSS數據分析
- Hive性能調優實戰
- SQL應用開發參考手冊
- 大數據:從海量到精準