- Hands-On Data Science with the Command Line
- Jason Morris Chris McCubbin Raymond Page
- 684字
- 2021-07-02 13:58:53
So, why the command line?
As the field of data science is still fairly new (it used to be called operations research), the tools and frameworks are also fairly new. With that being said, the command line is almost 50 years old and still one of the most powerful tools used today. If you're familiar with interpreters, the command line will come easy to you. Think of it as a place to experiment and see your results in real time. Every command you enter is executed interactively, and when you call a bash script to run, it executes sequentially (unless you decide not to, more in later chapters). As we know, experimenting and exploring is most of what data science tries to accomplish (and it's the most fun!).
I was having a conversation with a newly-graduated data science student about parsing text and asked, "How would you take a small file and provide a word count on how many time the words appear?" By now everyone is familiar with the infamous Hadoop word-count example. It's considered the "Hello, World" of data science.
The answer I received was a little shocking but expected. The student instantly replied that they'd use Hadoop to read the file, tokenize the words to form a key/value pair, reduce all the keys and values that are grouped together, and add up the occurrences. The student isn't wrong, in fact, that's a perfectly acceptable answer. Especially if the file is too large for a single system (big data), you already have the code in place to scale.
With that being said, what if I told you there's a quicker way to obtain the results that doesn't require programming in Java and setting up a cluster or having Hadoop run locally? In fact, it would only take one line to complete the task? Check out the following code:
cat file.txt | tr '[:space:]' '[\n*]' | grep -v "^$" | sort | uniq -c | sort -bnr
(tr '[:space:]' '[\n*]' | grep -v "^$" | sort | uniq -c | sort -bnr )<file.txt
This may seem like a lot, especially if you've never used the command line before, so let's break it down. The cat command reads files sequentially and writes them to standard output. |, also known as pipe or the pipe operator, combines a sequence of commands chained together by their standard streams so that the output of each process (stdout) feeds directly as input (stdin) to the next one. tr (translate) reads the input from cat (via | ) and writes the result to standard output that replaces spaces with new lines. The grep command is very powerful and the most used for a lot of data parsing. grep is used to search plain-text data for lines that match a regular expression. In this example, grep trims out the empty lines. sort is used for, well, sorting! You'll notice a lot of the commands are named for what they actually do. The sort command prints the lines of its input or concatenation of files listed in its argument list in sorted order. uniq is a command that, when fed a text file, outputs the file with adjacent identical lines collapsed to one. It usually works well with the sort command. In this example, uniq -c is called to count occurrences. And finally, sort -bnr sorts in numeric reverse order and ignores whitespace.
Don't worry if the example looks foreign to you. The command line also comes with manual pages for each command. All you have to do is man the command to view the page. You can even man man to get an idea of what the man command does! Give it a whirl and man tr or man sort. Oh, you don't have the command line set up? It's easier than you think, and we can get you up in running in minutes, so let's get started.
- 現(xiàn)代測控系統(tǒng)典型應(yīng)用實例
- Circos Data Visualization How-to
- R Machine Learning By Example
- 樂高機器人EV3設(shè)計指南:創(chuàng)造者的搭建邏輯
- Ceph:Designing and Implementing Scalable Storage Systems
- 觸控顯示技術(shù)
- 筆記本電腦維修90個精選實例
- 空間機械臂建模、規(guī)劃與控制
- 從零開始學(xué)PHP
- 實用網(wǎng)絡(luò)流量分析技術(shù)
- Hands-On Deep Learning with Go
- PostgreSQL High Performance Cookbook
- 傳感技術(shù)基礎(chǔ)與技能實訓(xùn)
- Mastering Android Game Development with Unity
- 軟件需求最佳實踐