Hive 基础及实践

置顶大数据/ Hive 大数据 Hive

创建时间:2018-04-23 18:19

字数:4.7k

Hive 安装
DDL 数据定义
DML 数据操作
1. 数据导入
2. 数据导出
查询
函数

Hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张表,并提供类SQL查询功能,它可以将 HQL 转化成 MapReduce 程序在 yarn 上运行。

文中命令行默认为 hive> 命令行，其他命令行会特别说明，比如 shell>

Hive 安装

DDL 数据定义

创建数据库

数据库在HDFS上的默认存储路径是 /user/hive/warehouse/*.db

标准写法

1	create database if not exists db_hive;

指定数据库在 HDFS 中的存储位置

1	create database db_hive2 location '/db_hive2.db';

修改数据库

数据库的名称和数据库所在的目录位置是不可修改的，ALTER DATABAS 只能为数据库的 DBPROPERTIES 设置键值对属性值，来描述这个数据库的属性信息。

1	alter database db_hive set dbproperties('createtime'='20180630');

显示数据库

1 2	show databases; show databases like 'db_hive*';

查看数据库详情

1 2	desc database db_hive; desc database extended db_hive;

使用数据库

1	use db_hive;

删除数据库

删除空数据

1	drop database if exists db_hive2;

删除空数据库

1	drop database if exists db_hive2 cascade;

创建表语法

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name 
[(col_name data_type [COMMENT col_comment], ...)] 
[COMMENT table_comment] 
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 
[CLUSTERED BY (col_name, col_name, ...) 
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] 
[ROW FORMAT row_format] 
[STORED AS file_format] 
[LOCATION hdfs_path]

部分关键字解释：

EXTERNAL：

1
2
3

表示创建一个外部表，在建表的同时指定一个指向实际数据的路径（LOCATION）。
Hive创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。
在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。

COMMENT：为表和列添加注释
PARTITIONED： BY创建分区表
CLUSTERED BY：创建分桶

ROW FORMAT：

DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] 
          [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] 
          | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
    
用户在建表的时候可以自定义 SERDE 或者使用自带的 SERDE。如果没有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED，将会使用自带的 SERDE。
在建表的时候，用户还需要为表指定列，用户在指定表的列的同时也会指定自定义的SerDe，Hive通过SerDe确定表的具体的列的数据。

STORED AS：指定存储文件类型

1
2

常用的存储文件类型：SEQUENCEFILE（二进制序列文件）、TEXTFILE（文本）、RCFILE（列式存储格式文件）
如果文件数据是纯文本，可以使用STORED AS TEXTFILE。如果数据需要压缩，使用 STORED AS SEQUENCEFILE。

LOCATION：指定表在HDFS上的存储位置
LIKE允许用户复制现有的表结构，但是不复制数据。

管理表(内部表)

默认创建的表都是管理表(内部表)，Hive 会控制着数据的生命周期。Hive 默认情况下会将这些表的数据存储在由配置项hive.metastore.warehouse.dir(例如，/user/hive/warehouse)所定义的目录的子目录下。当删除一个管理表时，Hive也会删除这个表中数据，管理表不适合和其他工具共享数据。

普通创建表

create table if not exists student(
    id int, 
    name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/hive/warehouse/student';

根据查询结果创建表

1	create table if not exists student2 as select id, name from student;

根据已经存在的表结构创建表

1	create table if not exists student3 like student;

查询表的类型

1 2	desc formatted student; # 输出很多信息，找到 Table Type: MANAGED_TABLE，表示该表是一个管理表

外部表

Hive 表不拥有外部表的完整数据，删除外部表不会删除掉这份数据，不过描述表的元数据信息会被删除掉。建议将 HDFS 读入的数据存入外部表，在外部表的基础上做大量的统计分析，用到的中间表、结果表使用内部表存储。

创建数据库

1	create database if not exists nba;

创建球队表

create external table if not exists nba.team(
    id int,
    name string
)
row format delimited fields terminated by '\t';

球员表

create external table if not exists nba.player(
    id int,
    name string,
    position string,
    team_id int
)
row format delimited fields terminated by '\t';

查看创建的表

show tables;

向外部表导入数据

1 2	load data local inpath '/home/zj/datas/team.txt' into table nba.team; load data local inpath '/home/zj/datas/player.txt' into table nba.player;

查询结果

1 2	select * from nba.team; select * from nba.player;

查看表元数据

1 2	desc formatted team; # 可以看到 Table Type: EXTERNAL_TABLE

注意，删除外部表的时候并不能删除HDFS中存在的文件，如果想彻底删除，必须删除该表指向的文件

附件

team.txt
player.txt

分区表

分区表就是对应 HDFS 上的一个独立的文件夹，该文件夹下是该分区所有的数据文件。
Hive 中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过 WHERE 子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。

分区表的创建

注意分区字段名不能和表的字段名重复

create table nba.team_partition(
    id int, 
    name string
)
partitioned by (division string) 
row format delimited fields terminated by '\t';

加载数据到分区表

1
2

load data local inpath '/home/zj/datas/team_east.txt' into table nba.team_partition partition(division='east');
load data local inpath '/home/zj/datas/team_west.txt' into table nba.team_partition partition(division='west');

查询分区表中的数据

1
2
3

select * from nba.team_partition where division='west';
select * from nba.team_partition where division='east';
select * from nba.team_partition where division='east' union select * from nba.team_partition where division='west';

增加分区

1	alter table nba.team_partition add partition(division='north') partition(division="south");

删除分区

1	alter table nba.team_partition drop partition(division='north'), partition(division='south');

查看分区

1	show partitions nba.team_partition;

查看分区表结构

1	desc formatted nba.team_partition;

创建二级分区表

create table nba.team_sub_partition(
    id int, 
    name string
)
partitioned by (division string, sub_division string) 
row format delimited fields terminated by '\t';

加载数据到二级分区表中

1	load data local inpath '/home/zj/datas/team_west_pacific.txt' into table nba.team_sub_partition partition(division='west', sub_division='pacific');

查询二级分区数据

1	select * from nba.team_sub_partition where division='west' and sub_division='pacific';

附件

team_east.txt
team_west.txt
team_west_pacific.txt

直接上传文件到分区目录，并使分区表和数据产生关联

上传数据后修复

# 创建目录
shell> hadoop fs -mkdir -p /user/hive/warehouse/nba.db/team_sub_partition/division=west/sub_division=northwest
# 上传源文件
shell> hadoop fs -put /home/zj/datas/team_west_northwest.txt /user/hive/warehouse/nba.db/team_sub_partition/division=west/sub_division=northwest
# 查询数据(查询不到数据)
select * from nba.team_sub_partition where division='west' and sub_division='northwest';
# 执行修复命令
msck repair table nba.team_sub_partition;
# 再次查询可以查到数据
select * from nba.team_sub_partition where division='west' and sub_division='northwest'

上传数据后添加分区

# 创建目录
shell> hadoop fs -mkdir -p /user/hive/warehouse/nba.db/team_sub_partition/division=west/sub_division=southwest
# 上传文件
shell> hadoop fs -put /home/zj/datas/team_west_southwest.txt /user/hive/warehouse/nba.db/team_sub_partition/division=west/sub_division=southwest
# 添加分区
alter table nba.team_sub_partition add partition(division='west',sub_division='southwest');
# 查询数据
select * from nba.team_sub_partition where division='west' and sub_division='southwest';

上传数据后 load 数据到分区

# 创建目录
shell> hadoop fs -mkdir -p /user/hive/warehouse/nba.db/team_sub_partition/division=east/sub_division=atlantic
# load 上传文件
load data local inpath '/home/zj/datas/team_east_atlantic.txt' into table nba.team_sub_partition partition(division='east',sub_division='atlantic');
# 查询数据
select * from nba.team_sub_partition where division='east' and sub_division='atlantic';

附件

team_east_atlantic.txt
team_west_southwest.txt
team_west_northwest.txt

表相关的操作

表重命名

1	alter table nba.team_sub_partition rename to nba.team_partition2;

更新列的语法

1	ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST\|AFTER column_name]

增加和替换列的语法

1
2

# ADD是代表新增一字段，字段位置在所有列后面(partition列前)，REPLACE则是表示替换表中所有字段。
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)

查询表结构

1	desc nba.team_partition;

添加列

1	alter table nba.team_partition add columns(city string);

更新列

1	alter table nba.team_partition change column city old_city string;

替换列

1	alter table nba.team_partition replace columns(team_id string, team_name);

删除表

1	drop table if exists time_partition2;

清除表中的数据

1 2	truncate table team; # 只能删除管理表(内部表)，不能删除外部表中的数据

DML 数据操作

数据导入

向表中装载数据（Load）

1	load data [local] inpath '/opt/module/datas/student.txt' [overwrite] into table student [partition (partcol1=val1,…)];

load data: 表示加载数据
local: 表示从本地加载数据到 hive 表，否则从 HDFS 加载数据到 hive 表
inpath: 表示加载数据的路径
into table: 表示加载到哪张表
student: 表示具体的表
overwrite: 表示覆盖表中已有数据，否则表示追加
partition: 表示上传到指定分区

通过查询语句向表中插入数据（Insert）

drop table if exists team;
# 创建一张分区表
create table team(
    id int,
    name string
)
partitioned by (division string) 
row format delimited fields terminated by '\t';

# 插入数据
insert into table team partition(division='west') values ('1', 'Warriors');

# 根据单表查询结果插入数据
insert into table team partition(division='east') select id , name from team where division='west';

创建表时用 as select 加载数据

1 2	create table if not exists team2 as select id, name from team; # 注意，这种方式不能复制分区

创建表并通过 location 指定加载数据路径

# 创建表
drop table if exists team;
create table team(
    id int, 
    name string
)
row format delimited fields terminated by '\t' 
location '/datas/team';

# 上传数据到 HDFS
shell> hadoop fs -mkdir -p /datas/team
shell> hadoop fs -put /home/zj/datas/team.txt /datas/team
shell> hadoop fs -put /home/zj/datas/team_partition.txt /datas/team

# 查询 
select * from team;

import数据到指定hive表中(需要先 export 导出)…

# 导出
export table nba.team to '/datas/export/team';
# 清空
truncate table student;
# 导入
import table team partition(division='west') from '/datas/export/team';

数据导出

Inster 导出

# 将查询的结果导出到本地
insert overwrite local directory '/home/zj/datas/export'select * from team;

# 将查询的结果格式化导出到本地
insert overwrite local directory '/home/zj/datas/export'
row format delimited fields terminated by '\t' 
collection items terminated by '\n'
select * from team;
             
# 将查询的结果导出到HDFS上(没有local)
insert overwrite directory '/datas/export'
row format delimited fields terminated by '\t' 
collection items terminated by '\n'
select * from student;

注意: 导出的是块数据

Hadoop命令导出到本地

1	shell> hadoop fs -get /user/hive/warehouse/nba.db/team2/000000_0 /home/zj/datas/export/team.txt

Hive Shell 命令导出

1	shell> hive -e 'select * from nba.team;' > /home/zj/datas/export/nba.team.txt;

Export导出到HDFS上

1	export table nba.team to '/user/hive/warehouse/export/team';

Sqoop 导出

查询

查询语法

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]
  [ORDER BY col_list]
  [CLUSTER BY col_list
  | [DISTRIBUTE BY col_list] [SORT BY col_list]
  ]
   [LIMIT number]

SQL 语言大小写不敏感。
SQL 可以写在一行或者多行
关键字不能被缩写也不能分行
各子句一般要分行写。
使用缩进提高语句的可读性。

基本查询

全表和特定字段查询

1
2
3

select * from team;
select name from team;
select name as team_name from team;

limit 语句

1	select * from team limit 2;

where 语句

1	select * from team where id <= 3;

比较运算符

select * from team where id=4;
select * from team where id between 2 and 4;
select * from team where name is null;
select * from team where id in (1, 3, 5);

逻辑运算符(AND/OR/NOT)

1
2
3

select * from team where id > 1 and id < 4;
select * from team where id > 4 or name='Lakers';
select * from team where id not in (1, 2, 3);

like 和 rlike

% 代表任意个字符， _ 代表一个字符

# 搜查找 name 以 'l' 开头的
select * from team where name like 'L%';
# 查找第二个字符为 'a' 的
select * from team where name like '_a%';
# 查找含有 'er' 的
# rlike 子句是Hive中这个功能的一个扩展，其可以通过J正则表达式指定匹配条件。
select * from team where name rlike 'er';

分组

group by 语句

group by 语句通常会和聚合函数一起使用，按照一个或者多个列队结果进行分组，然后对每个组执行聚合操作。

1	select division, avg(id) avg_id from team_partition group by division;

having 语句

where针对表中的列发挥作用，查询数据；having针对查询结果中的列发挥作用，筛选数据。
where后面不能写分组函数，而having后面可以使用分组函数
having只用于group by分组统计语句。
1
select division, avg(id) avg_id from team_partition group by division having avg_id > 4;
join 语句
内连接
只有进行连接的两个表中都存在与连接条件相匹配的数据才会被保留下来。

1	select player.name, team.name from player join team on player.team_id=team.id;

左外连接

JOIN操作符左边表中符合WHERE子句的所有记录将会被返回，不符合条件的位置用 NULL 替代

1	select p.name, t.name from player p left join team t on p.team_id=t.id;

右外连接

JOIN操作符右边表中符合WHERE子句的所有记录将会被返回，不符合条件的位置用 NULL 替代

1	select p.name, t.name from player p right join team t on p.team_id=t.id;

Full 连接

将会返回所有表中符合WHERE语句条件的所有记录。如果任一表的指定字段没有符合条件的值的话，那么就使用NULL值替代

1	select p.name, t.name from player p full join team t on p.team_id=t.id;

笛卡尔积join

省略连接条件或连接条件无效的情况下发生笛卡尔积

1	select player.name, team.name from player,team;

排序

全局排序(order by)

# 普通排序
select * from team order by id desc;
# 表达式别名排序
select name, id*2 as ids from team order by ids desc;
# 多列排序
select * from team order by name,id;

每个MapReduce内部排序(sort by)

每个MapReduce内部进行排序，对全局结果集来说不是排序。

# 设置 reduce 个数
set mapreduce.job.reduces=3;
# 查看 reduce 个数
set mapreduce.job.reduces;
# id 降序
select * from team sort by id desc;

分区排序(distribute by)

类似MR中partition，进行分区，结合sort by使用
注意: Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。
对于distribute by进行测试，一定要分配多reduce进行处理，否则无法看到distribute by的效果。

select * from player distribute by team_id sort by id desc;

player.id	player.name	player.position	player.team_id
6	James Harden	G	3
5	Chris Paul	G	3
2	LeBron James	FG	1
1	Kobe Bryant	SG	1
4	Paul George	F	2
3	Russell Westbrook	G	2

CLUSTER BY

当distribute by和sorts by字段相同时，可以使用cluster by方式。
cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是倒序排序，不能指定排序规则为ASC或者DESC。

1
2
3

select * from player cluster by team_id;
等价于
select * from player distribute by team_id sort by team_id;

注意: 分区不一定是按照固定的数据，team_id为 1和2 的也可能分配到一个分区里去

分桶及抽样调查

分桶表

分区针对的是数据的存储路径；分桶针对的是数据文件

分区提供一个隔离数据和优化查询的便利方式。不过，并非所有的数据集都可形成合理的分区，特别是无法确定合适的划分大小的时候。

分桶是将数据集分解成更容易管理的若干部分的一种技术。

桶表不能通过load的方式直接加载数据，只能从另一张表中插入数据。

创建分桶表

create table team_bucket(
    id int,
    name string
)
clustered by (id) into 4 buckets
row format delimited fields terminated by '\t';

查看表结构

1 2	desc formatted team_bucket; # Num Buckets: 4

设置参数

1 2	set hive.enforce.bucketing=true; set mapreduce.job.reduces=-1;

从 team 表中通过子查询把数据导入到分桶表 team_bucket

1
2
3

insert into table team_bucket select id, name from team cluster by (id);
select * from team_bucket;
# 分桶成功

分桶抽样查询

分桶表非常适合解决抽样查询的需求

select * from team_bucket tablesample(bucket 1 out of 4 on id);
# tablesample(bucket x out of y)

# y 必须是 bucket 的倍数或者因子。
# hive 根据 y 的大小，决定抽样的比例。
# 假设 table 分成了 4 份，当 y=2 时，抽取 (4/2=2) 个 bucket 的数据，当 y=8时，抽取 (4/8=1/2) 个bucket的数据。

# x 表示从哪个 bucket 开始抽取。
# 假设 table 从 bucket 为 4，tablesample(bucket 4 out of 4) 表示抽取 (4/4=1) 个 bucket 的数据，抽取第 4 个 bucket 的数据

# 注意: x的值必须小于等于y的值

数据块抽样

Hive提供了另外一种按照百分比进行抽样的方式，这种是基于行数的，按照输入路径下的数据块百分比进行的抽样。

1	select * from team_bucket tablesample(0.1 percent);

注意: 这种抽样方式不一定适用于所有的文件格式。

函数

常用函数

查看系统自带的函数

1 2	show functions; # 显示函数的用法

显示自带的函数的用法

1 2	desc function count; desc function extended count;

求总行数(count)

1	select count(id) as num from team;

求最大值(max)

1	select max(id) as max_id from team;

求最小值(min)

1	select min(id) as min_id from team;

求总和(sun)

1	select sum(id) as sum_id from team;

求均值(avg)

1	select avg(id) as avg_id from team;

自定义函数

hive 自定义函数可以分为三种类别:

UDF(User-Defined-Function): 一进一出
UDAF(uuer-Defined Aggregation Function): 聚合函数，多进一出
UDTF(User-Defined Table-Generating Functions): 一进多出

因为 hive 是java程序，所以自定义函数最好是用java语言编写。

UDF(Python示例)

编写python

python 编写 hive UDF 是逐行处理标准输入流

"""
实现 upper 功能
"""
import sys

for line in sys.stdin:
    line = line.upper()
    print(line)

添加 python 文件到集群

1	add file /home/zj/hive/udf/upper.py

利用 UDF 查询

# 语法
SELECT TRANSFORM (<columns>) USING 'python <python_script>' AS (<columns>) FROM <table>;
# 示例
select transform(name) using 'python3 upper.py' as upper_name from team;
# 注意: 使用transform的时候不能查询别的列