user@master ~ $ mkdir baseball user@master ~ $ cd baseball/ user@master ~/baseball $ wget http://seanlahman.com/files/database/lahman2012-csv.zip user@master ~/baseball $ unzip lahman2012-csv.zip
user@master ~/baseball $ hadoop fs -mkdir baseball user@master ~/baseball $ hadoop fs -put *.csv baseball user@master ~/baseball $ hadoop fs -ls baseball Found 24 items -rw------- 3 user user 198529 2013-12-23 16:27 baseball/AllstarFull.csv -rw------- 3 user user 5730747 2013-12-23 16:28 baseball/Appearances.csv -rw------- 3 user user 7304 2013-12-23 16:28 baseball/AwardsManagers.csv -rw------- 3 user user 240867 2013-12-23 16:28 baseball/AwardsPlayers.csv -rw------- 3 user user 16719 2013-12-23 16:28 baseball/AwardsShareManagers.csv ... 略 ...
MS Access Versions: lahman2012.mdb 2012readme.txt
hive
就可以進入 Hive 的互動式查詢介面。
user@master ~/baseball $ hive Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-0.8.1-cdh4.0.1.jar!/hive-log4j.properties Hive history file=/tmp/user/hive_job_log_user_201312231714_241463960.txt hive>
hive>
開頭的,就代表是在 Hive 的互動式介面中執行的指令。create database <資料庫名稱>
。
hive> create database baseball_userXX; OK Time taken: 5.983 seconds
MASTER - Player names, DOB, and biographical info --- 2.1 MASTER table --- lahmanID Unique number assigned to each player playerID A unique code asssigned to each player. The playerID links the data in this file with records in the other files. managerID An ID for individuals who served as managers hofID An ID for individuals who are in teh baseball Hall of Fame birthYear Year player was born birthMonth Month player was born birthDay Day player was born birthCountry Country where player was born birthState State where player was born birthCity City where player was born deathYear Year player died deathMonth Month player died deathDay Day player died deathCountry Country where player died deathState State where player died deathCity City where player died nameFirst Player's first name nameLast Player's last name nameNote Note about player's name (usually signifying that they changed their name or played under two differnt names) nameGiven Player's given name (typically first and middle) nameNick Player's nickname weight Player's weight in pounds height Player's height in inches bats Player's batting hand (left, right, or both) throws Player's throwing hand (left or right) debut Date that player made first major league appearance finalGame Date that player made first major league appearance (blank if still active) college College attended lahman40ID ID used in Lahman Database version 4.0 lahman45ID ID used in Lahman database version 4.5 retroID ID used by retrosheet holtzID ID used by Sean Holtz's Baseball Almanac bbrefID ID used by Baseball Reference website
欄位 | 範例 | 資料型態 |
---|---|---|
lahmanID | 1 | INT |
playerID | aaronha01 | STRING |
managerID | NULL | INT |
hofID | aaronha01h | STRING |
birthYear | 1934 | INT |
birthMonth | 2 | INT |
birthDay | 5 | INT |
birthCountry | USA | STRING |
birthState | AL | STRING |
birthCity | Mobile | STRING |
deathYear | NULL | INT |
deathMonth | NULL | INT |
deathDay | NULL | INT |
deathCountry | NULL | STRING |
deathState | NULL | STRING |
deathCity | NULL | STRING |
nameFirst | Hank | STRING |
nameLast | Aaron | STRING |
nameNote | NULL | STRING |
nameGiven | Henry Louis | STRING |
nameNick | “Hammer, Hammerin’ Hank, Bad Henry” | STRING |
weight | 180 | INT |
height | 72 | INT |
bats | R | STRING |
throws | R | STRING |
debut | 4/13/1954 | STRING |
finalGame | 10/3/1976 | STRING |
college | NULL | STRING |
lahman40ID | aaronha01 | STRING |
lahman45ID | aaronha01 | STRING |
retroID | aaroh101 | STRING |
holtzID | aaronha01 | STRING |
bbrefID | aaronha01 | STRING |
create table <資料庫名稱>.<資料表名稱> ( 欄位1 資料型態1, ...., 欄位N 資料型態N )
hive> create table baseball_userXX.Master ( lahmanID INT, playerID STRING, managerID INT, hofID STRING, birthYear INT, birthMonth INT, birthDay INT, birthCountry STRING, birthState STRING, birthCity STRING, deathYear INT, deathMonth INT, deathDay INT, deathCountry STRING, deathState STRING, deathCity STRING, nameFirst STRING, nameLast STRING, nameNote STRING, nameGiven STRING, nameNick STRING, weight INT, height INT, bats STRING, throws STRING, debut STRING, finalGame STRING, college STRING, lahman40ID STRING, lahman45ID STRING, retroID STRING, holtzID STRING, bbrefID STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
Decimal datatype was introduced in Hive 0.11.0 (HIVE-2693) and revised in Hive 0.13.0 (HIVE-3976).
hive> LOAD DATA LOCAL INPATH "Master.csv" OVERWRITE INTO TABLE baseball_userXX.Master;
SHOW DATABASES;
。
hive> SHOW DATABASES; OK baseball default Time taken: 0.088 seconds
USE <資料庫名稱>;
。
hive> USE baseball_userXX; OK Time taken: 0.014 seconds
SHOW TABLES;
。
hive> SHOW TABLES; OK master Time taken: 0.086 seconds
SELECT * FROM <資料表名稱> WHERE <條件>;
來進行資料的檢索。
hive> SELECT * FROM Master; hive> SELECT lahmanID FROM Master;
hive> SELECT lahmanID FROM Master WHERE birthyear > 1900; hive> SELECT COUNT( * ) FROM Master;
MapReduce Total cumulative CPU time: 2 seconds 900 msec Ended Job = job_201312211330_0022 No encryption was performed by peer. MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Accumulative CPU: 2.9 sec HDFS Read: 0 HDFS Write: 0 SUCESS Total MapReduce CPU Time Spent: 2 seconds 900 msec OK 18126 Time taken: 97.89 seconds
quit;
。
hive> quit;
hive -e <HiveQL 查詢語法>
的方式執行查詢。若加上 -S
參數,則會抑制 MapReduce 的標準錯誤輸出(STDERR)。
user@master ~/baseball $ hive -S -e "use baseball; SELECT birthyear, lahmanID, nameFirst FROM master WHERE birthyear > 1900 ;"