實作九

Hive (1) : 將 CSV 資料匯出到 Hive

下載 CSV 示範資料：魔球

巨量資料的應用已經擴及生活中的許多角落，電影「魔球(MONEYBALL)」就是一個例子。在 Sean Lahman 的網站上公佈了一些關於棒球的 CSV 資料，是採創用 CC 3.0 授權(Creative Commons Attribution-ShareAlike 3.0 Unported License)，因此，我們就先拿這份資料來當作學習如何將 CSV 格式資料匯入 Hive 的範例。

首先，我們下載 2012 年的球賽統計資料，並解壓縮。

user@master ~ $ mkdir baseball
user@master ~ $ cd baseball/
user@master ~/baseball $ wget http://seanlahman.com/files/database/lahman2012-csv.zip
user@master ~/baseball $ unzip lahman2012-csv.zip

接著，我們必須將資料上傳到 HDFS。讓我們在 HDFS 上建立一個名為 baseball 的目錄，並將所有 CSV 檔案上傳到該目錄中。

user@master ~/baseball $ hadoop fs -mkdir baseball
user@master ~/baseball $ hadoop fs -put *.csv baseball
user@master ~/baseball $ hadoop fs -ls baseball
Found 24 items
-rw-------   3 user user     198529 2013-12-23 16:27 baseball/AllstarFull.csv
-rw-------   3 user user    5730747 2013-12-23 16:28 baseball/Appearances.csv
-rw-------   3 user user       7304 2013-12-23 16:28 baseball/AwardsManagers.csv
-rw-------   3 user user     240867 2013-12-23 16:28 baseball/AwardsPlayers.csv
-rw-------   3 user user      16719 2013-12-23 16:28 baseball/AwardsShareManagers.csv
... 略 ...

建立 Hive 資料庫

雖然 CSV 有欄位定義，但 Hive 並無法自動判斷欄位的資料型態，因此我們還是必須根據資料來源的定義，自行建立 Hive 資料表的 Schema。
如果想瞭解這份資料的每個 CSV 檔有哪些欄位，請參閱 readme 2012.txt 這個說明檔案。像是有 Open Source Sports 這樣的網站，而他們每年都會公佈各年度的統計資料。說明文件中也有提到另有 MS Access 的格式，若有需要，也可以下載這個版本，就可以比較輕易地進行 MS Access 先轉 MS SQL Server 再轉到 Hive 資料表的轉換，就可以省卻自己定義 Schema 的步驟。
```
MS Access Versions:
      lahman2012.mdb
      2012readme.txt
```
在 Hive 上匯入 CSV 資料的順序是
- 建立資料庫(database)
- 建立資料表(table)
- 將資料匯入到資料表(import)
- 驗證資料表內容是否正確(verify)
首先，就讓我們來學習如何在 Hive 建立一個資料庫的語法。Hive 跟其他資料庫系統很像，也有一個互動式的 Shell 環境。請在命令列中執行指令 hive 就可以進入 Hive 的互動式查詢介面。
```
user@master ~/baseball $ hive
Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-0.8.1-cdh4.0.1.jar!/hive-log4j.properties
Hive history file=/tmp/user/hive_job_log_user_201312231714_241463960.txt
hive> 
```
後面的步驟若是看到 hive> 開頭的，就代表是在 Hive 的互動式介面中執行的指令。
Hive 的語法，我們稱為 HiveQL，建立資料庫的 HiveQL 語法是 create database <資料庫名稱>。
```
hive> create database baseball_userXX;
OK
Time taken: 5.983 seconds
```

建立 Hive 資料表

接著，讓我們選定某一個想要分析的 CSV 資料，例如：Master.csv，先瞭解 Master 這個檔案的欄位跟資料型態。說明文件中提到

  MASTER - Player names, DOB, and biographical info
---
2.1 MASTER table
---
lahmanID       Unique number assigned to each player
playerID       A unique code asssigned to each player.  The playerID links
               the data in this file with records in the other files.
managerID      An ID for individuals who served as managers
hofID          An ID for individuals who are in teh baseball Hall of Fame
birthYear      Year player was born
birthMonth     Month player was born
birthDay       Day player was born
birthCountry   Country where player was born
birthState     State where player was born
birthCity      City where player was born
deathYear      Year player died
deathMonth     Month player died
deathDay       Day player died
deathCountry   Country where player died
deathState     State where player died
deathCity      City where player died
nameFirst      Player's first name
nameLast       Player's last name
nameNote       Note about player's name (usually signifying that they changed
               their name or played under two differnt names)
nameGiven      Player's given name (typically first and middle)
nameNick       Player's nickname
weight         Player's weight in pounds
height         Player's height in inches
bats           Player's batting hand (left, right, or both)
throws         Player's throwing hand (left or right)
debut          Date that player made first major league appearance
finalGame      Date that player made first major league appearance (blank if still active)
college        College attended
lahman40ID     ID used in Lahman Database version 4.0
lahman45ID     ID used in Lahman database version 4.5
retroID        ID used by retrosheet
holtzID        ID used by Sean Holtz's Baseball Almanac
bbrefID        ID used by Baseball Reference website

Master 的資料型態
欄位	範例	資料型態
lahmanID	1	INT
playerID	aaronha01	STRING
managerID	NULL	INT
hofID	aaronha01h	STRING
birthYear	1934	INT
birthMonth	2	INT
birthDay	5	INT
birthCountry	USA	STRING
birthState	AL	STRING
birthCity	Mobile	STRING
deathYear	NULL	INT
deathMonth	NULL	INT
deathDay	NULL	INT
deathCountry	NULL	STRING
deathState	NULL	STRING
deathCity	NULL	STRING
nameFirst	Hank	STRING
nameLast	Aaron	STRING
nameNote	NULL	STRING
nameGiven	Henry Louis	STRING
nameNick	“Hammer, Hammerin’ Hank, Bad Henry”	STRING
weight	180	INT
height	72	INT
bats	R	STRING
throws	R	STRING
debut	4/13/1954	STRING
finalGame	10/3/1976	STRING
college	NULL	STRING
lahman40ID	aaronha01	STRING
lahman45ID	aaronha01	STRING
retroID	aaroh101	STRING
holtzID	aaronha01	STRING
bbrefID	aaronha01	STRING

HiveQL 建立資料表的語法是 create table <資料庫名稱>.<資料表名稱> ( 欄位1 資料型態1, ...., 欄位N 資料型態N )

hive> create table baseball_userXX.Master
( lahmanID INT, playerID STRING, managerID INT, hofID STRING, 
  birthYear INT, birthMonth INT, birthDay INT, birthCountry STRING, 
  birthState STRING, birthCity STRING, deathYear INT, deathMonth INT,
  deathDay INT, deathCountry STRING, deathState STRING, deathCity STRING,
  nameFirst STRING, nameLast STRING, nameNote STRING, nameGiven STRING,
  nameNick STRING, weight INT, height INT, bats STRING, throws STRING,
  debut STRING, finalGame STRING, college STRING, lahman40ID STRING,
  lahman45ID STRING, retroID STRING, holtzID STRING, bbrefID STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;

<備註> 關於 Hive 支援的資料型態，請參考官方文件，並且注意目前您的 Hive 版本是否支援所需的資料型態。
```
Decimal datatype was introduced in Hive 0.11.0 (HIVE-2693) and revised in Hive 0.13.0 (HIVE-3976).
```

匯入 CSV 資料到 Hive 資料表

HiveQL 自 CSV 檔案匯入資料到 Hive 資料表的語法是 `LOAD DATA LOCAL INPATH <檔案路徑> [OVERWRITE] INTO TABLE <資料表名稱>。
```
hive> LOAD DATA LOCAL INPATH "Master.csv" OVERWRITE INTO TABLE baseball_userXX.Master;
```

檢查匯出結果

首先，我們用 HiveQL 的語法，來檢查是否有正常產生資料庫。語法為 SHOW DATABASES;。
```
hive> SHOW DATABASES;
OK
baseball
default
Time taken: 0.088 seconds
```
接著，讓我們切換預設的資料庫，變成剛剛產生的 baseball 資料庫。語法為 USE <資料庫名稱>;。
```
hive> USE baseball_userXX;
OK
Time taken: 0.014 seconds
```
我們可以用類似 SQL 語法，來查詢目前的資料庫有哪幾個資料表。語法為 SHOW TABLES;。
```
hive> SHOW TABLES;
OK
master
Time taken: 0.086 seconds
```
其次，我們可以檢查一下剛剛建立的 baseball.master 資料表，內容是否正常。由於 HiveQL 與 SQL-92 有很高的相容性，因此我們可以用標準的 SELECT * FROM <資料表名稱> WHERE <條件>; 來進行資料的檢索。
```
hive> SELECT * FROM Master;
hive> SELECT lahmanID FROM Master;
```

執行完以上的查詢後，您或許已經發現，當下的條件愈多，Hive 的執行速度就愈久。那是因為 Hive 必須將您下的查詢，轉換成 MapReduce 任務。

hive> SELECT lahmanID FROM Master WHERE birthyear > 1900;
hive> SELECT COUNT( * ) FROM Master;

MapReduce Total cumulative CPU time: 2 seconds 900 msec
Ended Job = job_201312211330_0022
No encryption was performed by peer.
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Accumulative CPU: 2.9 sec   HDFS Read: 0 HDFS Write: 0 SUCESS
Total MapReduce CPU Time Spent: 2 seconds 900 msec
OK
18126
Time taken: 97.89 seconds

若要離開 Hive Shell 的話，請下 quit;。
```
hive> quit;
```
有時，如果想直接執行很短的查詢，也可以用 hive -e <HiveQL 查詢語法> 的方式執行查詢。若加上 -S 參數，則會抑制 MapReduce 的標準錯誤輸出(STDERR)。
```
user@master ~/baseball $ hive -S -e "use baseball; SELECT birthyear, lahmanID, nameFirst FROM master WHERE birthyear > 1900 ;"
```

參考資料

“Build a data library with Hive”, By Peter J. Jamack, Big Data Analytics Consultant, 21 May 2013, IBM developerWorks

本文件最後更新於：