当前位置：首页 > Hive学习笔记

Hive学习笔记

62 次阅读
3 次下载
2025/7/7 11:00:37

hive> dfs -ls;

还可以这样用

hive> set $i='121.61.99.14.128160791368.5';

hive> select count(*) from c02_clickstat_fatdt1 where cookie_id=$i; 11

2.7.3 Hive Resources

Hive can manage the addition of resources to a session where those resources need to be made available at query execution time. Any locally accessible file can be added to the session. Once a file is added to a session, hive query can refer to this file by its name (in

map/reduce/transform clauses) and this file is available locally at execution time on the entire hadoop cluster. Hive uses Hadoop's

Distributed Cache to distribute the added files to all the machines in the cluster at query execution time. Usage:

ADD { FILE[S] | JAR[S] | ARCHIVE[S] } []* ? LIST { FILE[S] | JAR[S] | ARCHIVE[S] } [ ..] ? DELETE { FILE[S] | JAR[S] | ARCHIVE[S] } [ ..] ? FILE resources are just added to the distributed cache. Typically, this might be something like a transform script to be executed. ? JAR resources are also added to the Java classpath. This is required in order to reference objects they contain such as UDF's. ? ARCHIVE resources are automatically unarchived as part of distributing them.

Example:

hive> add FILE /tmp/tt.py; ? hive> list FILES; ? /tmp/tt.py ? hive> from networks a MAP a.networkid USING 'python tt.py' as nn where a.ds = '2009-01-04' limit 10; ? 25

It is not neccessary to add files to the session if the files used in a transform script are already available on all machines in the hadoop cluster using the same path name. For example:

... MAP a.networkid USING 'wc -l' ...: here wc is an executable available on all machines ? ... MAP a.networkid USING

'/home/nfsserv1/hadoopscripts/tt.py' ...: here tt.py may be

accessible via a nfs mount point that's configured identically on all the cluster nodes

2.7.4 调用python、shell等语言

如下面这句sql就是借用了weekday_mapper.py对数据进行了处理

CREATE TABLE u_data_new ( userid INT, movieid INT, rating INT, weekday INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t'; add FILE weekday_mapper.py; INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data; ，其中weekday_mapper.py内容如下

import sys import datetime for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split('\\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print '\\t'.join([userid, movieid, rating, str(weekday)]) 26

如下面的例子则是使用了shell的cat命令来处理数据 FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';

2.8 DROP

删除一个内部表的同时会同时删除表的元数据和数据。删除一个外部表，只删除元数据而保留数据。

2.9 其它 2.9.1 Limit

Limit 可以限制查询的记录数。查询的结果是随机选择的。下面的查询语句从 t1 表中随机查询5条记录： SELECT * FROM t1 LIMIT 5

2.9.2 Top k

下面的查询语句查询销售记录最大的 5 个销售代表。 SET mapred.reduce.tasks = 1

SELECT * FROM sales SORT BY amount DESC LIMIT 5

2.9.3 REGEX Column Specification

SELECT 语句可以使用正则表达式做列选择，下面的语句查询除了 ds 和 hr 之外的所有列：

SELECT `(ds|hr)?+.+` FROM sales

3. Hive Select

语法： SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT number] 3.1 Group By

基本语法：

groupByClause: GROUP BY groupByExpression (, groupByExpression)* groupByExpression: expression groupByQuery: SELECT expression (, expression)* FROM src groupByClause?

高级特性：

? 聚合可进一步分为多个表，甚至发送到Hadoop的DFS的文件（可以进行操作，然后

使用HDFS的utilitites）。例如我们可以根据性别划分，需要找到独特的页面浏览量按年龄划分。如下面的例子： FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count(DISTINCT pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum' SELECT pv_users.age, count(DISTINCT pv_users.userid) GROUP BY pv_users.age;

? hive.map.aggr可以控制怎么进行汇总。默认为为true，配置单元会做的第一级聚合

直接在MAP上的任务。这通常提供更好的效率，但可能需要更多的内存来运行成功。

搜索更多关于： Hive学习笔记的文档

版权认领

下载文档10.00 元 加入VIP免费下载

推荐下载

本文作者：...

共分享92篇相关文档

文档简介：

hive> dfs -ls; 还可以这样用 hive> set $i='121.61.99.14.128160791368.5'; hive> select count(*) from c02_clickstat_fatdt1 where cookie_id=$i; 11 2.7.3 Hive Resources Hive can manage the addition of resources to a session where those resources need to be made available at query execution time. Any locally accessible file can be added to the session. Once a file is a

Hive学习笔记

相关文档

相关推荐