当前位置:首页 > Hive学习笔记
hive> dfs -ls;
还可以这样用
hive> set $i='121.61.99.14.128160791368.5';
hive> select count(*) from c02_clickstat_fatdt1 where cookie_id=$i; 11
2.7.3 Hive Resources
Hive can manage the addition of resources to a session where those resources need to be made available at query execution time. Any locally accessible file can be added to the session. Once a file is added to a session, hive query can refer to this file by its name (in
map/reduce/transform clauses) and this file is available locally at execution time on the entire hadoop cluster. Hive uses Hadoop's
Distributed Cache to distribute the added files to all the machines in the cluster at query execution time. Usage:
ADD { FILE[S] | JAR[S] | ARCHIVE[S] }
?
Example:
hive> add FILE /tmp/tt.py; ? hive> list FILES; ? /tmp/tt.py ? hive> from networks a MAP a.networkid USING 'python tt.py' as nn where a.ds = '2009-01-04' limit 10; ? 25
It is not neccessary to add files to the session if the files used in a transform script are already available on all machines in the hadoop cluster using the same path name. For example:
... MAP a.networkid USING 'wc -l' ...: here wc is an executable available on all machines ? ... MAP a.networkid USING
'/home/nfsserv1/hadoopscripts/tt.py' ...: here tt.py may be
accessible via a nfs mount point that's configured identically on all the cluster nodes
?
2.7.4 调用python、shell等语言
如下面这句sql就是借用了weekday_mapper.py对数据进行了处理
CREATE TABLE u_data_new ( userid INT, movieid INT, rating INT, weekday INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t'; add FILE weekday_mapper.py; INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data; ,其中weekday_mapper.py内容如下
import sys import datetime for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split('\\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print '\\t'.join([userid, movieid, rating, str(weekday)]) 26
如下面的例子则是使用了shell的cat命令来处理数据 FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';
2.8 DROP
删除一个内部表的同时会同时删除表的元数据和数据。删除一个外部表,只删除元数据而保留数据。
2.9 其它 2.9.1 Limit
Limit 可以限制查询的记录数。查询的结果是随机选择的。下面的查询语句从 t1 表中随机查询5条记录: SELECT * FROM t1 LIMIT 5
2.9.2 Top k
下面的查询语句查询销售记录最大的 5 个销售代表。 SET mapred.reduce.tasks = 1
SELECT * FROM sales SORT BY amount DESC LIMIT 5
2.9.3 REGEX Column Specification
SELECT 语句可以使用正则表达式做列选择,下面的语句查询除了 ds 和 hr 之外的所有列:
27
SELECT `(ds|hr)?+.+` FROM sales
3. Hive Select
语法: SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT number] 3.1 Group By
基本语法:
groupByClause: GROUP BY groupByExpression (, groupByExpression)* groupByExpression: expression groupByQuery: SELECT expression (, expression)* FROM src groupByClause?
高级特性:
? 聚合可进一步分为多个表,甚至发送到Hadoop的DFS的文件(可以进行操作,然后
使用HDFS的utilitites)。例如我们可以根据性别划分,需要找到独特的页面浏览量按年龄划分。如下面的例子: FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count(DISTINCT pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum' SELECT pv_users.age, count(DISTINCT pv_users.userid) GROUP BY pv_users.age;
? hive.map.aggr可以控制怎么进行汇总。默认为为true,配置单元会做的第一级聚合
直接在MAP上的任务。这通常提供更好的效率,但可能需要更多的内存来运行成功。
28
共分享92篇相关文档