Hive实战 · 大数据笔记

## 创建外部表 ```sql create external table track_info( ip string, country string, province string, city string, url string, time string, page string ) partitioned by (day string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' location '/project/trackinfo/'; ``` > location后面的路径指的是hdfs上面的路径。 * 创建数据输入文件夹 ``` hdfs dfs -mkdir -p /project/input/raw/ ``` * 数据上传到hdfs ``` hdfs dfs -put track.data /project/input/raw/ ``` * 现在要用之前的项目中ETLApp类进行数据清洗。 * 先把类中固定的地址设置为运行参数 ~~~ Path outputPath = new Path(args[1]); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); ~~~ * 还有一个地方，由于使用了ip解析库，所以先要把ip库文件上传，然后更改代码里面的地址。 ![](https://img.kancloud.cn/ad/72/ad724575e5b4d361d12a2b7f91521fb2_733x171.png) * 然后打包，注意不要使用idea命令行，会提示找不到mvn。 ``` wangyijiadeMacBook-Air:~ bizzbee$ cd /Users/bizzbee/IdeaProjects/hadooptrainv2 wangyijiadeMacBook-Air:hadooptrainv2 bizzbee$ mvn clean package -DskipTests ``` * 然后上传服务器 ``` wangyijiadeMacBook-Air:target bizzbee$ scp hadoop-train-v2-1.0-SNAPSHOT.jar bizzbee@192.168.31.249:~/work/ ``` * 执行（参数分别是输入输出路径） ``` [bizzbee@bizzbee ~]$ hadoop jar /home/bizzbee/work/hadoop-train-v2-1.0-SNAPSHOT.jar com.bizzbee.bigdata.hadoop.mr.project.mr.ETLApp hdfs://bizzbee:8020/project/input/raw/track.data hdfs://bizzbee:8020/project/output/result ``` *可以查看到执行成功了 ``` http://192.168.31.249:50070/explorer.html#/project/output/result ``` * 然后把完成etl的数据加载到track_info表中。 ``` hive> LOAD DATA INPATH 'hdfs://bizzbee:8020/project/output/result' OVERWRITE INTO TABLE track_info partition(day='2019-11-12'); ``` * 在hive中使用sql统计每个省的流量。 ``` select province,count(*) as cnt from track_info where day='2019-11-12' group by province ; ``` ![](https://img.kancloud.cn/30/de/30de37e088213f317acb5711dbcdafc9_182x308.png) * 创建每个沈的统计数据表。 ```sql create table track_info_province_stat( province string, cnt bigint ) partitioned by (day string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; ``` * 把之前查询的结果作为数据导入省份表。 ```sql insert overwrite table track_info_province_stat partition(day='2019-11-12') select province,count(*) as cnt from track_info where day='2019-11-12' group by province ; ```