$ head -n 2 gameviews.tsv
marketplace customer_id review_id product_id product_parent product_title product_category star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date
US 21269168 RSH1OZ87OYK92 B013PURRZW 603406193 Madden NFL 16 - Xbox One Digital Code Digital_Video_Games 2 2 3 N N A slight improvement from last year. I keep buying madden every year hoping they get back to football. This years version is a little better than last years -- but that's not saying much.The game looks great. The only thing wrong with the animation, is the way the players are always tripping on each other.<br /><br />The gameplay is still slowed down by the bloated pre-play controls. What used to take two buttons is now a giant PITA to get done before an opponent snaps the ball or the play clock runs out.<br /><br />The turbo button is back, but the player movement is still slow and awkward. If you liked last years version, I' m guessing you'll like this too. I haven' t had a chance to play anything other than training and a few online games, so I'm crossing my fingers and hoping the rest is better.<br /><br />The one thing I can recommend is NOT TO BUY THE MADDEN BUNDLE. The game comes as a download. So if you hate it, there' s no trading it in at Gamestop. 2015-08-31
第一行是表头,代表下面每一行中每一列的含义:
DATA COLUMNS:
01 marketplace - 2 letter country code of the marketplace where the review was written.
02 customer_id - Random identifier that can be used to aggregate reviews written by a single author.
03 review_id - The unique ID of the review.
04 product_id - The unique Product ID the review pertains to.
05 product_parent - Random identifier that can be used to aggregate reviews for the same product.
06 product_title - Title of the product.
07 product_category - Broad product category that can be used to group reviews
08 star_rating - The 1-5 star rating of the review.
09 helpful_votes - Number of helpful votes.
10 total_votes - Number of total votes the review received.
11 vine - Review was written as part of the Vine program.
12 verified_purchase - The review is on a verified purchase.
13 review_headline - The title of the review.
14 review_body - The review text.
15 review_date - The date the review was written.
回顾之前的 print 第几列:可以这样试一试:
$ awk '{print $1}' gamereviews.tsv | head
marketplace
US
US
US
US
US
US
US
US
US
$ awk '{print $2}' gamereviews.tsv | head
customer_id
21269168
133437
45765011
113118
22151364
22151364
38426028
6057518
20715661
$ awk '{print $3}' gamereviews.tsv | head
review_id
RSH1OZ87OYK92
R1WFOQ3N9BO65I
R3YOOS71KM5M9
R3R14UATT3OUFU
RV2W9SGDNQA2C
R3CFKLIZ0I2KOB
R1LRYU1V0T3O38
R44QKV6FE5CJ2
R2TX1KLPXXXNYS
打到第6列 标题 的时候,结果开始不对劲了:
$ awk '{print $6}' gamereviews.tsv | head
product_title
Madden
Xbox
Command
Playstation
Saints
Double
Sims
Playstation
Playstation
至少上面看过,第二行,也就是第一条 review,title 应该是 Madden NFL 16 - Xbox One Digital Code
,而这里只切割出来的第一个单词。
原因就是, awk
默认以
,一个空格作为区分一行中不同 fileds
的分隔符 Field Separators
。
使用参数 -F '{fields separators}'
来覆盖默认的空格分隔符,这个数据集里的分隔符都是一个 tab
,所以上面的命令可以改为:
$ awk -F '\t' '{print $6}' gamereviews.tsv | head
product_title
Madden NFL 16 - Xbox One Digital Code
Xbox Live Gift Card
Command & Conquer The Ultimate Collection [Instant Access]
Playstation Plus Subscription
Saints Row IV - Enter The Dominatrix [Online Game Code]
Double Dragon: Neon [Online Game Code]
Sims 4
Playstation Network Card
Playstation Network Card
Noted: By default, awk does more than split the input on spaces. It splits based on one or more sequence of space or tab or newline characters. In addition, any of these three characters at the start or end of input gets trimmed and won’t be part of field contents. Newline characters come into play if the record separator results in newline within the record content.
再回顾一下之前的 NF
和 NR
,获取一下最后一行的日期(15 review date)、第8个评分(08 star rating)以及倒数第三个(13 review_headline):
$ awk -F '\t' '{print NR,$NF,$8,$(NF-2) }' gamereviews.tsv | head -n 10
1 review_date star_rating review_headline
2 2015-08-31 2 A slight improvement from last year.
3 2015-08-31 5 Five Stars
4 2015-08-31 5 Hail to the great Yuri!
5 2015-08-31 5 Five Stars
6 2015-08-31 5 Five Stars
7 2015-08-31 5 Five Stars
8 2015-08-31 4 i like the new skills like herbalism in this
9 2015-08-31 5 Five Stars
10 2015-08-31 5 Easy & Fast
至此都是在对每一行进行操作, awk
实际包含模式匹配,之前提过, awk
的语法实际上是一条条的 pattern { action }
,目前我们也只用上了 {print}
这个 action
。
关于模式匹配,一个简单的例子:
$ echo "aa 1
bb 2
cc 3" | awk -F ' ' '/bb/ {print $2}'
2
可以看到,对于 awk 扫描的三行数据,只有匹配到了 bb
的第二行,运行了 print
的 action
。这里的 /bb/
就是 pattern
。
回到之前的数据集,比如我想匹配 Minecraft
这个游戏的 reviews,看看大家的评分,这么写:
$ awk -F '\t' '/Minecraft/ {print $6, $8}' gamereviews.tsv | head
Minecraft for PC/Mac [Online Game Code] 4
Minecraft for PC/Mac [Online Game Code] 5
Xbox 360 Live Points Card 5
Minecraft for PC/Mac [Online Game Code] 2
Minecraft for PC/Mac [Online Game Code] 5
Minecraft - Xbox 360 5
Minecraft for PC/Mac [Online Game Code] 1
Nom Nom Galaxy - PS4 [Digital Code] 4
Minecraft - Xbox One Digital Code 5
Minecraft for PC/Mac [Online Game Code] 4
确实过滤出来了 Minecraft
,但是同时,任何字段里面带有 Minecraft
的 reviews 都被统计进来了,包括任何提及它的别的游戏的 reviews:
$ awk -F '\t' '/Minecraft/ {print $6, $8}' gamereviews.tsv | sort | uniq
8BitMMO [Game Connect] 2
Ace of Spades: Battle Builder [Online Game Code] 5
Agricultural Simulator Historical Farming [Download] 1
Bioschock Infinite: Clash in the Clouds 5
Blini Kids: Animals [Download] 5
Blockland 4
Blockland 5
Blockstorm [Download] 4
Borderlands 2 5
Call of Duty 4: Modern Warfare [Download] 5
所以还需要针对某个字段进行匹配: pattern
可以写成 $6 ~ /Minecraft/
来模糊匹配 06 product_title
里面有 Minecraft
的
$ awk -F '\t' ' $6 ~ /Minecraft/ {print $4,$6,$8} ' gamereviews.tsv | head
B010KYDNDG Minecraft for PC/Mac [Online Game Code] 4
B010KYDNDG Minecraft for PC/Mac [Online Game Code] 5
B010KYDNDG Minecraft for PC/Mac [Online Game Code] 2
B010KYDNDG Minecraft for PC/Mac [Online Game Code] 5
B010BWCOWI Minecraft - Xbox 360 5
B010KYDNDG Minecraft for PC/Mac [Online Game Code] 1
B010KYDNDG Minecraft for PC/Mac [Online Game Code] 5
B00NMO0IA8 Minecraft - Xbox One Digital Code 5
B010KYDNDG Minecraft for PC/Mac [Online Game Code] 4
B010KYDNDG Minecraft for PC/Mac [Online Game Code] 4
或者根据我们确认到的 04 product_id
值 B010KYDNDG
来准确匹配所有的 Minecraft for PC/Mac [Online Game Code]
,使用 $4 == "B010KYDNDG"
:
$ awk -F '\t' '$4 == "B010KYDNDG" {print $6, $8} ' gamereviews.tsv | head
Minecraft for PC/Mac [Online Game Code] 4
Minecraft for PC/Mac [Online Game Code] 5
Minecraft for PC/Mac [Online Game Code] 2
Minecraft for PC/Mac [Online Game Code] 5
Minecraft for PC/Mac [Online Game Code] 1
Minecraft for PC/Mac [Online Game Code] 5
Minecraft for PC/Mac [Online Game Code] 4
Minecraft for PC/Mac [Online Game Code] 4
Minecraft for PC/Mac [Online Game Code] 5
Minecraft for PC/Mac [Online Game Code] 5
已经精确匹配了 $4 == "B010KYDNDG"
,修改一下 print 的字段,依次打印 15:review_date, 13:review_headline, 08:star_rating
$ awk -F '\t' '$4 == "B010KYDNDG" {print $15, $13, $8} ' gamereviews.tsv | head
2015-08-31 FUN 4
2015-08-31 ... have disks for games as a backup it was nice to be able to get the code then start ... 5
2015-08-30 Would rather a disk. There always seems to be ... 2
2015-08-29 Five Stars 5
2015-08-27 Very Disappointed Dad and Birthday Boy 1
2015-08-26 Five Stars 5
2015-08-24 Four Stars 4
2015-08-24 My son says it is fun. He likes it but we were a little ... 4
2015-08-24 MINECWAFT 5
2015-08-22 MINECRAFT!!! 5
总结一下:
awk
的语法 pattern { action }
。
模糊匹配使用两个 /
包括来: /regexp/
针对字段模糊匹配: $n ~ /regexp/
, ~
表示匹配上, !~
表示不匹配
针对字段精确匹配: $n == "value"
上面已经可以按照我们期望的条件,匹配我们需要的内容并打印出来, print
这个 action
后面接的参数,如果用 ,
隔开,就是按照默认的 输出域分隔符(Output Field Separators)
来分割每个 fields
,这个分隔符在 awk 内部用 OFS
表示,还有 ORS
表示 输出记录分隔符(Output Record Separators)
。记单词就很容易记下来: Field - 域(一个字段)
, Record - 记录(一行内容)
。
> 其实之前的 -F
参数,也是修改的 FS
。对应的输入记录分隔符为 RS
,默认为一个换行符
上面的命令,可以通过加上修改参数 OFS
和 ORS
来让结果更美观:
$ awk '{FS="\t"; OFS=" : "; ORS= "\n\n"} $4 == "B010KYDNDG" {print $15, $13, $8} ' gamereviews.tsv | head
2015-08-31 : FUN : 4
2015-08-31 : ... have disks for games as a backup it was nice to be able to get the code then start ... : 5
2015-08-30 : Would rather a disk. There always seems to be ... : 2
2015-08-29 : Five Stars : 5
2015-08-27 : Very Disappointed Dad and Birthday Boy : 1
除了设置输出分隔符,也可以让输出内容按照我们期望的格式直接输出, awk
的 printf
和 C/Shell 的格式很像:
%c
: Print a number as a character
%d,%i
: Print a decimal integer
%f
: Print a number in floating-point notation
%s
: Print a string.
%x
: 16 进制
%o
: 8进制
%u
: unsigned
更多格式控制符见 Control-Letters
用引号括住所有格式,后面的参数里按顺序填入对应的变量:
$ awk -F '\t' ' $4 == "B010KYDNDG" {printf "%s :%s %s\n", $15, $13, $8} ' gamereviews.tsv | head
2015-08-31 :FUN 4
2015-08-31 :... have disks for games as a backup it was nice to be able to get the code then start ... 5
2015-08-30 :Would rather a disk. There always seems to be ... 2
2015-08-29 :Five Stars 5
2015-08-27 :Very Disappointed Dad and Birthday Boy 1
2015-08-26 :Five Stars 5
2015-08-24 :Four Stars 4
2015-08-24 :My son says it is fun. He likes it but we were a little ... 4
2015-08-24 :MINECWAFT 5
2015-08-22 :MINECRAFT!!! 5
width
宽度修饰词: printf "%4s", "foo"
=> ·foo
// ·
表示空格
-
(Minus),用在宽度修饰词前面,变成左对齐。(否则默认是右对齐): printf "%-4s", "foo"
=> foo·
.
精度修饰词,根据后面不同的格式控制符,含义稍有不同: printf "%.4s", "foobar"
=> foob
精度修饰词可以叠加再其它修饰词后面使用,例如控制内容长度为25,宽度为30且左对齐,可以写作 %-30.25s
:
$ awk -F '\t' ' $4 == "B010KYDNDG" {printf "%s : %-30.25s %s\n", $15, $13, $8} ' gamereviews.tsv | head
2015-08-31 : FUN 4
2015-08-31 : ... have disks for games 5
2015-08-30 : Would rather a disk. Ther 2
2015-08-29 : Five Stars 5
2015-08-27 : Very Disappointed Dad and 1
2015-08-26 : Five Stars 5
2015-08-24 : Four Stars 4
2015-08-24 : My son says it is fun. He 4
2015-08-24 : MINECWAFT 5
2015-08-22 : MINECRAFT!!! 5
注: 原文章里提到了可以使用 build-in
方法 substr
来操作 $13
: sub($13,1,25)
可以达到同样的效果。
其它更多修饰词及用法见 Format-Modifiers
现在以及能够搜索、匹配、按格式打印了,来试着做一点计算。
继续上面的例子,能够得到所有的评分,如果期望计算所有评分的平均值,我们需要一个临时变量来求和, awk
里,变量可以直接声明并使用:
$ awk -F '\t' ' \
$4 == "B010KYDNDG" {total = total + $8; printf "now we have : %d \n", total} \
' gamereviews.tsv | head
now we have : 4
now we have : 9
now we have : 11
now we have : 16
now we have : 17
now we have : 22
now we have : 26
now we have : 30
now we have : 35
now we have : 40
每计算后都打印一次会输出太多无关内容,通过 BEGIN
或者 END
作为 pattern
的语句,可以在其它主要语句的运行前和运行后分别运行一次:
$ awk -F '\t' '\
BEGIN {printf "\nBEGIN:\n"} \
$4 == "B010KYDNDG" {total = total + $8; cnt = cnt + 1} \
END {printf "sum is %d, avg is %.2f\nEND\n", total, total/cnt} \
' gamereviews.tsv
BEGIN:
sum is 131, avg is 3.97
END