官术网_书友最值得收藏!

So, why the command line?

As the field of data science is still fairly new (it used to be called operations research), the tools and frameworks are also fairly new. With that being said, the command line is almost 50 years old and still one of the most powerful tools used today. If you're familiar with interpreters, the command line will come easy to you. Think of it as a place to experiment and see your results in real time. Every command you enter is executed interactively, and when you call a bash script to run, it executes sequentially (unless you decide not to, more in later chapters). As we know, experimenting and exploring is most of what data science tries to accomplish (and it's the most fun!).

I was having a conversation with a newly-graduated data science student about parsing text and asked, "How would you take a small file and provide a word count on how many time the words appear?" By now everyone is familiar with the infamous Hadoop word-count example. It's considered the "Hello, World" of data science.

The answer I received was a little shocking but expected. The student instantly replied that they'd use Hadoop to read the file, tokenize the words to form a key/value pair, reduce all the keys and values that are grouped together, and add up the occurrences. The student isn't wrong, in fact, that's a perfectly acceptable answer. Especially if the file is too large for a single system (big data), you already have the code in place to scale.

With that being said, what if I told you there's a quicker way to obtain the results that doesn't require programming in Java and setting up a cluster or having Hadoop run locally? In fact, it would only take one line to complete the task? Check out the following code:

cat file.txt | tr '[:space:]' '[\n*]' | grep -v "^$" | sort | uniq -c | sort -bnr
(tr '[:space:]' '[\n*]' | grep -v "^$" | sort | uniq -c | sort -bnr )<file.txt

This may seem like a lot, especially if you've never used the command line before, so let's break it down. The cat command reads files sequentially and writes them to standard output. |, also known as pipe or the pipe operator, combines a sequence of commands chained together by their standard streams so that the output of each process (stdout) feeds directly as input (stdin) to the next one. tr (translate) reads the input from cat (via | ) and writes the result to standard output that replaces spaces with new lines. The grep command is very powerful and the most used for a lot of data parsing. grep is used to search plain-text data for lines that match a regular expression. In this example, grep trims out the empty lines. sort is used for, well, sorting! You'll notice a lot of the commands are named for what they actually do. The sort command prints the lines of its input or concatenation of files listed in its argument list in sorted order. uniq is a command that, when fed a text file, outputs the file with adjacent identical lines collapsed to one. It usually works well with the sort command. In this example, uniq -c is called to count occurrences. And finally, sort -bnr sorts in numeric reverse order and ignores whitespace.

Don't worry if the example looks foreign to you. The command line also comes with manual pages for each command. All you have to do is man the command to view the page. You can even man man to get an idea of what the man command does! Give it a whirl and man tr or man sort. Oh, you don't have the command line set up? It's easier than you think, and we can get you up in running in minutes, so let's get started.

主站蜘蛛池模板: 南岸区| 绥阳县| 西丰县| 牡丹江市| 余干县| 日喀则市| 普宁市| 曲靖市| 靖远县| 九龙城区| 兴和县| 丰都县| 甘谷县| 万山特区| 民丰县| 徐水县| 唐海县| 肇东市| 雷州市| 浪卡子县| 唐海县| 福贡县| 日土县| 鄂托克前旗| 新平| 资兴市| 离岛区| 龙胜| 屯昌县| 枝江市| 长沙县| 水城县| 信宜市| 太仆寺旗| 环江| 婺源县| 海安县| 册亨县| 乌兰察布市| 大荔县| 临颍县|