🔥码云GVP开源项目 12k star Uniapp+ElementUI 功能强大 支持多语言、二开方便! 广告
[TOC] # 1. RC&ORC介绍 **RC:** RC(Record Columnar)由Facebook开源 1、 存储行集合,并在集合中以列格式存储行数据 2、 引入轻量级索引,允许跳过不相关的行块 3、 可分割:允许并行处理行集合 4、 可压缩<br/> **ORC:** ORC(Optimized Row Columnar),RC优化版。 ![](https://img.kancloud.cn/12/d0/12d0cbdd03ded7497b679d6602d38eaf_628x244.png) 如果初始文件大小为585GB,采用RC存储后可以压缩到505GB,采用ORC后可以压缩到131GB,当然数据是不会丢失的。<br/> **RCFile存储结构:** 1、 集行存储与列存储的优点于一身; 2、 设计思想与Parquet类似,先按行水平切割为多个行组,再对每个行组内的数据按列存储; :-: ![](https://img.kancloud.cn/df/2c/df2cdd26932b3187e5c2395d9de576b2_550x336.png) RC设计思想 ![](https://img.kancloud.cn/d1/05/d1059322774e4d7fbcda7ce022cebaea_601x649.png) RCFile存储格式 **Stripe:** >1、 每个ORC文件首先会被横向切分成多个Stripe; 2、 每个stripe默认的大小是250MB; 3、 每个stripe由多组(Row Groups)行数据组成; **IndexData:** >1、 保存了该stripe上数据的位置,总行数; **RowData:** >1、 以stream的形式保存数据; **Stripe Footer:** >1、 包含该stripe统计结果:Max,Min,count等信息; **FileFooter:** >1、 该表的统计结果; 2、 各个Stripe的位置信息; **Postscript:** >3、 该表的行数,压缩参数,压缩大小,列等信息; <br/> # 2. Java读写ORCFile 在 *`pom.xml`* 中依然下面的依赖 ```xml <dependency> <groupId>org.apache.orc</groupId> <artifactId>orc-core</artifactId> <version>1.5.1</version> </dependency> <dependency> <groupId>org.apache.orc</groupId> <artifactId>orc-mapreduce</artifactId> <version>1.5.1</version> </dependency> <dependency> <groupId>org.apache.orc</groupId> <artifactId>orc-tools</artifactId> <version>1.5.1</version> </dependency> ``` Java代码示例: ```java import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.hive.ql.exec.vector.ColumnVector; import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector; import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; import org.apache.orc.*; import java.io.IOException; public class ORCFileOps { private static Configuration conf = new Configuration(); private static String ORCPATH = "/tmp/orcfile.orc"; public static void main(String[] args) throws IOException { write(); read(); } public static void write() throws IOException { // 定义schema TypeDescription schema = TypeDescription.fromString("struct<x:int,y:int>"); // 创建writer Writer writer = OrcFile.createWriter(new Path(ORCPATH), OrcFile.writerOptions(conf).setSchema(schema)); // 写文件 VectorizedRowBatch batch = schema.createRowBatch(); LongColumnVector x = (LongColumnVector) batch.cols[0]; LongColumnVector y = (LongColumnVector) batch.cols[1]; // 模拟10000行数据 for (int r = 0; r < 10000; ++r) { int row = batch.size++; x.vector[row] = r; y.vector[row] = r * 3; // 默认每个batch为1024行,如果满了,则新起一个batch. if (batch.size == batch.getMaxSize()) { writer.addRowBatch(batch); batch.reset(); } } if (batch.size != 0) { writer.addRowBatch(batch); batch.reset(); } writer.close(); } public static void read() throws IOException { // 使用OrcFile创建Reader Reader reader = OrcFile.createReader(new Path(ORCPATH), OrcFile.readerOptions(conf)); // 读取文件 RecordReader rows = reader.rows(); // 获取schema信息 VectorizedRowBatch batch = reader.getSchema().createRowBatch(); // 输出 while (rows.nextBatch(batch)) { System.out.println("================华丽的分割线======================"); System.out.println("本批次行数: " + batch.size); // 将orc类型转化为Java基本类型 ColumnVector[] cols = batch.cols; LongColumnVector vx = (LongColumnVector) cols[0]; LongColumnVector vy = (LongColumnVector) cols[1]; long[] x = vx.vector; long[] y = vy.vector; // 打印 x 和 y for (int i = 0; i < batch.size; i++) { System.out.println(x[i] + ":" + y[i]); } } rows.close(); } } ``` <br/> # 3. 在Hive中使用ORC ```sql create external table user_orc_ext( name string, age int ) stored as orc; 0: jdbc:hive2://hadoop101:10000> select * from user_orc_ext; +--------------------+-------------------+--+ | user_orc_ext.name | user_orc_ext.age | +--------------------+-------------------+--+ +--------------------+-------------------+--+ 0: jdbc:hive2://hadoop101:10000> show create table user_orc_ext; +----------------------------------------------------+--+ | createtab_stmt | +----------------------------------------------------+--+ | CREATE EXTERNAL TABLE `user_orc_ext`( | | `name` string, | | `age` int) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' | | LOCATION | | 'hdfs://hadoop101:9000/home/hadoop/hive/warehouse/hivebook.db/user_orc_ext' | | TBLPROPERTIES ( | | 'transient_lastDdlTime'='1609156760') | +----------------------------------------------------+--+ ```