實作九

Hive (1) : 將 CSV 資料匯出到 Hive

下載 CSV 示範資料:魔球

  1. 巨量資料的應用已經擴及生活中的許多角落,電影「魔球(MONEYBALL)」就是一個例子。在 Sean Lahman 的網站上公佈了一些關於棒球的 CSV 資料,是採創用 CC 3.0 授權(Creative Commons Attribution-ShareAlike 3.0 Unported License),因此,我們就先拿這份資料來當作學習如何將 CSV 格式資料匯入 Hive 的範例。
  2. 首先,我們下載 2012 年的球賽統計資料,並解壓縮。
    user@master ~ $ mkdir baseball
    user@master ~ $ cd baseball/
    user@master ~/baseball $ wget http://seanlahman.com/files/database/lahman2012-csv.zip
    user@master ~/baseball $ unzip lahman2012-csv.zip
    
  3. 接著,我們必須將資料上傳到 HDFS。讓我們在 HDFS 上建立一個名為 baseball 的目錄,並將所有 CSV 檔案上傳到該目錄中。
    user@master ~/baseball $ hadoop fs -mkdir baseball
    user@master ~/baseball $ hadoop fs -put *.csv baseball
    user@master ~/baseball $ hadoop fs -ls baseball
    Found 24 items
    -rw-------   3 user user     198529 2013-12-23 16:27 baseball/AllstarFull.csv
    -rw-------   3 user user    5730747 2013-12-23 16:28 baseball/Appearances.csv
    -rw-------   3 user user       7304 2013-12-23 16:28 baseball/AwardsManagers.csv
    -rw-------   3 user user     240867 2013-12-23 16:28 baseball/AwardsPlayers.csv
    -rw-------   3 user user      16719 2013-12-23 16:28 baseball/AwardsShareManagers.csv
    ... 略 ...
    

建立 Hive 資料庫

  1. 雖然 CSV 有欄位定義,但 Hive 並無法自動判斷欄位的資料型態,因此我們還是必須根據資料來源的定義,自行建立 Hive 資料表的 Schema。
  2. 如果想瞭解這份資料的每個 CSV 檔有哪些欄位,請參閱 readme 2012.txt 這個說明檔案。像是有 Open Source Sports 這樣的網站,而他們每年都會公佈各年度的統計資料。說明文件中也有提到另有 MS Access 的格式,若有需要,也可以下載這個版本,就可以比較輕易地進行 MS Access 先轉 MS SQL Server 再轉到 Hive 資料表的轉換,就可以省卻自己定義 Schema 的步驟。
    MS Access Versions:
          lahman2012.mdb
          2012readme.txt
    
  3. 在 Hive 上匯入 CSV 資料的順序是
  4. 首先,就讓我們來學習如何在 Hive 建立一個資料庫的語法。Hive 跟其他資料庫系統很像,也有一個互動式的 Shell 環境。請在命令列中執行指令 hive 就可以進入 Hive 的互動式查詢介面。
    user@master ~/baseball $ hive
    Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-0.8.1-cdh4.0.1.jar!/hive-log4j.properties
    Hive history file=/tmp/user/hive_job_log_user_201312231714_241463960.txt
    hive> 
    
  5. 後面的步驟若是看到 hive> 開頭的,就代表是在 Hive 的互動式介面中執行的指令。
  6. Hive 的語法,我們稱為 HiveQL,建立資料庫的 HiveQL 語法是 create database <資料庫名稱>
    hive> create database baseball_userXX;
    OK
    Time taken: 5.983 seconds
    

建立 Hive 資料表

  1. 接著,讓我們選定某一個想要分析的 CSV 資料,例如:Master.csv,先瞭解 Master 這個檔案的欄位跟資料型態。說明文件中提到
      MASTER - Player names, DOB, and biographical info
    ---
    2.1 MASTER table
    ---
    lahmanID       Unique number assigned to each player
    playerID       A unique code asssigned to each player.  The playerID links
                   the data in this file with records in the other files.
    managerID      An ID for individuals who served as managers
    hofID          An ID for individuals who are in teh baseball Hall of Fame
    birthYear      Year player was born
    birthMonth     Month player was born
    birthDay       Day player was born
    birthCountry   Country where player was born
    birthState     State where player was born
    birthCity      City where player was born
    deathYear      Year player died
    deathMonth     Month player died
    deathDay       Day player died
    deathCountry   Country where player died
    deathState     State where player died
    deathCity      City where player died
    nameFirst      Player's first name
    nameLast       Player's last name
    nameNote       Note about player's name (usually signifying that they changed
                   their name or played under two differnt names)
    nameGiven      Player's given name (typically first and middle)
    nameNick       Player's nickname
    weight         Player's weight in pounds
    height         Player's height in inches
    bats           Player's batting hand (left, right, or both)
    throws         Player's throwing hand (left or right)
    debut          Date that player made first major league appearance
    finalGame      Date that player made first major league appearance (blank if still active)
    college        College attended
    lahman40ID     ID used in Lahman Database version 4.0
    lahman45ID     ID used in Lahman database version 4.5
    retroID        ID used by retrosheet
    holtzID        ID used by Sean Holtz's Baseball Almanac
    bbrefID        ID used by Baseball Reference website
    
    Master 的資料型態
    欄位 範例 資料型態
    lahmanID 1 INT
    playerID aaronha01 STRING
    managerID NULL INT
    hofID aaronha01h STRING
    birthYear 1934 INT
    birthMonth 2 INT
    birthDay 5 INT
    birthCountry USA STRING
    birthState AL STRING
    birthCity Mobile STRING
    deathYear NULL INT
    deathMonth NULL INT
    deathDay NULL INT
    deathCountry NULL STRING
    deathState NULL STRING
    deathCity NULL STRING
    nameFirst Hank STRING
    nameLast Aaron STRING
    nameNote NULL STRING
    nameGiven Henry Louis STRING
    nameNick “Hammer, Hammerin’ Hank, Bad Henry” STRING
    weight 180 INT
    height 72 INT
    bats R STRING
    throws R STRING
    debut 4/13/1954 STRING
    finalGame 10/3/1976 STRING
    college NULL STRING
    lahman40ID aaronha01 STRING
    lahman45ID aaronha01 STRING
    retroID aaroh101 STRING
    holtzID aaronha01 STRING
    bbrefID aaronha01 STRING
  2. HiveQL 建立資料表的語法是 create table <資料庫名稱>.<資料表名稱> ( 欄位1 資料型態1, ...., 欄位N 資料型態N )
    hive> create table baseball_userXX.Master
    ( lahmanID INT, playerID STRING, managerID INT, hofID STRING, 
      birthYear INT, birthMonth INT, birthDay INT, birthCountry STRING, 
      birthState STRING, birthCity STRING, deathYear INT, deathMonth INT,
      deathDay INT, deathCountry STRING, deathState STRING, deathCity STRING,
      nameFirst STRING, nameLast STRING, nameNote STRING, nameGiven STRING,
      nameNick STRING, weight INT, height INT, bats STRING, throws STRING,
      debut STRING, finalGame STRING, college STRING, lahman40ID STRING,
      lahman45ID STRING, retroID STRING, holtzID STRING, bbrefID STRING )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
    

匯入 CSV 資料到 Hive 資料表

  1. HiveQL 自 CSV 檔案匯入資料到 Hive 資料表的語法是 `LOAD DATA LOCAL INPATH <檔案路徑> [OVERWRITE] INTO TABLE <資料表名稱>。
    hive> LOAD DATA LOCAL INPATH "Master.csv" OVERWRITE INTO TABLE baseball_userXX.Master;
    

檢查匯出結果

  1. 首先,我們用 HiveQL 的語法,來檢查是否有正常產生資料庫。語法為 SHOW DATABASES;
    hive> SHOW DATABASES;
    OK
    baseball
    default
    Time taken: 0.088 seconds
    
  2. 接著,讓我們切換預設的資料庫,變成剛剛產生的 baseball 資料庫。語法為 USE <資料庫名稱>;
    hive> USE baseball_userXX;
    OK
    Time taken: 0.014 seconds
    
  3. 我們可以用類似 SQL 語法,來查詢目前的資料庫有哪幾個資料表。語法為 SHOW TABLES;
    hive> SHOW TABLES;
    OK
    master
    Time taken: 0.086 seconds
    
  4. 其次,我們可以檢查一下剛剛建立的 baseball.master 資料表,內容是否正常。由於 HiveQL 與 SQL-92 有很高的相容性,因此我們可以用標準的 SELECT * FROM <資料表名稱> WHERE <條件>; 來進行資料的檢索。
    hive> SELECT * FROM Master;
    hive> SELECT lahmanID FROM Master;
    
  5. 執行完以上的查詢後,您或許已經發現,當下的條件愈多,Hive 的執行速度就愈久。那是因為 Hive 必須將您下的查詢,轉換成 MapReduce 任務。
    hive> SELECT lahmanID FROM Master WHERE birthyear > 1900;
    hive> SELECT COUNT( * ) FROM Master;
    
    MapReduce Total cumulative CPU time: 2 seconds 900 msec
    Ended Job = job_201312211330_0022
    No encryption was performed by peer.
    MapReduce Jobs Launched: 
    Job 0: Map: 1  Reduce: 1   Accumulative CPU: 2.9 sec   HDFS Read: 0 HDFS Write: 0 SUCESS
    Total MapReduce CPU Time Spent: 2 seconds 900 msec
    OK
    18126
    Time taken: 97.89 seconds
    
  6. 若要離開 Hive Shell 的話,請下 quit;
    hive> quit;
    
  7. 有時,如果想直接執行很短的查詢,也可以用 hive -e <HiveQL 查詢語法> 的方式執行查詢。若加上 -S 參數,則會抑制 MapReduce 的標準錯誤輸出(STDERR)。
    user@master ~/baseball $ hive -S -e "use baseball; SELECT birthyear, lahmanID, nameFirst FROM master WHERE birthyear > 1900 ;"
    

參考資料

  1. Build a data library with Hive”, By Peter J. Jamack, Big Data Analytics Consultant, 21 May 2013, IBM developerWorks

本文件最後更新於: