Introduction to Sqoop and Hive

Sqoop 與 Hive 簡介

Jazz Yao-Tsung Wang

大綱

PART 1 : 何時該使用哪一種工具呢?

巨量資料的三種處理工具

High Throughput Computing Technologies

Source : “High Throughput Computing Technologies”, by Jazz Yao-Tsung Wang, September 12, 2013

BIG DATA AT REST

MapReduce Framework

Source : “High Throughput Computing Technologies”, by Jazz Yao-Tsung Wang, September 12, 2013

BIG DATA IN MOTION

In-Memory Processing

Source : “High Throughput Computing Technologies”, by Jazz Yao-Tsung Wang, September 12, 2013

BIG DATA IN MOTION

Streaming Data Collection / Data Cleaning

Source : “High Throughput Computing Technologies”, by Jazz Yao-Tsung Wang, September 12, 2013

Apache Big Data Stack (1)

Apache Big Data Stack (1)

Source : “The Hadoop Stack - Then, Now and in the Future”, Hadoop World 2011

Apache Big Data Stack (2)

Apache Big Data Stack (2)

Source : “The Hadoop Stack - Then, Now and in the Future”, Hadoop World 2011

PART 2 : Introduction to Sqoop

Sqoop 使用語法

user@master ~ $ sqoop help
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.

Sqoop Import (匯入)

Figure 1: Sqoop Import Overview

Source : “Apache Sqoop – Overview”, by Arvind Prabhakar, October 06, 2011

Figure 1: Sqoop Import Overview

Sqoop Import 語法 (1)

user@master ~ $ sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:

--connect [jdbc-uri]              Specify JDBC connect string
-P                                Read password from console password
--username [username]             Set authentication username

Import control arguments:

--append                          Imports data in append mode
--as-avrodatafile                 Imports data to Avro data files
--as-sequencefile                 Imports data to SequenceFiles
--as-textfile                     Imports data as plain text (default)
--columns [col,col,col...]        Columns to import from table
-e,--query [statement]            Import results of SQL 'statement'
-m,--num-mappers [n]              Use 'n' map tasks to import in parallel
--table [table-name]              Table to read
--target-dir [dir]                HDFS plain table destination
--where [where clause]            WHERE clause to use during import

Sqoop Import 語法 (2)

Hive arguments:

--create-hive-table               Fail if the target hive table exists
--hive-import                     Import tables into Hive
--hive-table [table-name]         Sets the table name to use when importing to hive

HBase arguments:

--column-family [family]          Sets the target column family for the import
--hbase-create-table              If specified, create missing HBase tables
--hbase-row-key [col]             Specifies which input column to use as the row key
--hbase-table [table]             Import to [table] in HBase

At minimum, you must specify --connect and --table

Sqoop Export (匯出)

Figure 2: Sqoop Export Overview

Source : “Apache Sqoop – Overview”, by Arvind Prabhakar, October 06, 2011

Sqoop Export 語法

user@master ~ $ sqoop help export
usage: sqoop export [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:

   --connect [jdbc-uri]            Specify JDBC connect string
-P                                 Read password from console
   --username [username]           Set authentication username

Export control arguments:

   --columns [col,col,col...]      Columns to export to table
   --export-dir [dir]              HDFS source path for the export
-m,--num-mappers [n]               Use 'n' map tasks to export in parallel
   --table [table-name]            Table to populate

At minimum, you must specify --connect, --export-dir, and --table

PART 3 : Introduction to Hive

Hive 與傳統資料庫之比較

特徵 Hive RDBMS
Schema Schema on READ Schema on WRITE
更新(Update) 支援 INSERT 支援 UPDATE, INSERT, DELETE
交易(Transaction) 不支援 支援
索引(Indexes) 不支援 支援
延遲(Latency) 數分鐘 秒以內
函數(Function) 數十個內建函數 上百個內建函數
多重表格新增 支援 不支援
SELECT時建立資料表 支援 在 SQL-02 不支援
SELECT FROM 子句限用單一資料表 SQL-92 標準
JOIN INNER, OUTER, SEMI, MAP JOINS SQL-92 或其他變形
次查詢(Subqueries) 只能在 FROM 子句中使用 在任何子句
Table: Hive vs RDBMS Source: “表 12-2. SQL 與 HiveQL 的高階比較表”, Hadoop 技術大全,第三版

“Schema on Write” vs “Schema on Read”