Avro · Hadoop2.x

[TOC] # 1. Avro特点和存储格式 Apache Avro 是一个数据序列化系统，出自 Hadoop 之父 Doug Cutting。 Avro File 以 JSON 格式存储数据定义（Schema），以二进制格式存储数据。官网地址：http://avro.apache.org/docs/current/ 特点： ➢ 丰富的数据结构 ➢ 快速可压缩的二进制数据格式 ➢ 容器文件用于持久化数据 ➢ 自带远程过程调用 RPC ➢ 动态语言可以方便地处理 Avro 数据 :-: ![](https://img.kancloud.cn/c9/70/c97011a2287a821c780862ac9b16b2fe_1075x628.png) Avro存储格式 基本数据类型: null、 boolean、 int、 long、 float、 double、 bytes、 string 复杂数据类型：record、enum、array、map、union、fixed 可以自己写代码实现 avro 格式，也可以使用 avro-tools 应用（一个jar包）来实现 avro 格式。 # 2. 使用avro-tools应用来实现avro格式（1）在user.avsc文件定义User对象的数据存储格式（Schema） ```json { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": "int"}, {"name": "favorite_color", "type": "string"} ]} ``` （2）在user.json文件存储数据(data) ```json {"name": "Alyssa", "favorite_number": 256, "favorite_color": "black"} {"name": "Ben", "favorite_number": 7, "favorite_color": "red"} {"name": "Charlie", "favorite_number": 12, "favorite_color": "blue"} ``` （3）运行 avro-tools.jar将Schema+data生成user.avro文件。 avro-tools.jar可以到https://mvnrepository.com/artifact/org.apache.avro/avro-tools下载。 ```sql [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar fromjson --schema-file \ /hdatas/user.avsc /hdatas/user.json > /hdatas/user.avro ``` 或者使用压缩格式： ```sql [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar fromjson --codec snappy --schema-file \ /hdatas/user.avsc /hdatas/user.json > /hdatas/user.avro ``` （4）我们也可以将user.avro生成回json文件 ```sql -- 查看转换为json数据的格式 [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar tojson /hdatas/user.avro {"name":"Alyssa","favorite_number":256,"favorite_color":"black"} {"name":"Ben","favorite_number":7,"favorite_color":"red"} {"name":"Charlie","favorite_number":12,"favorite_color":"blue"} -- 将输出存储到user_002.json文件 [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar tojson \ /hdatas/user.avro > /hdatas/user_002.json ``` 或者输出为格式化的json文件： ```sql [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar tojson --pretty /hdatas/user.avro { "name" : "Alyssa", "favorite_number" : 256, "favorite_color" : "black" } { "name" : "Ben", "favorite_number" : 7, "favorite_color" : "red" } { "name" : "Charlie", "favorite_number" : 12, "favorite_color" : "blue" } [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar tojson --pretty \ /hdatas/user.avro > /hdatas/user_002.json ``` （5）我们也可以获取user.avro的元数据 ```sql -- 查看user.avro的元数据 [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar getmeta /hdatas/user.avro avro.schema {"type":"record","name":"User","namespace":"example.avro", "fields":[{"name":"name","type":"string"}, {"name":"favorite_number","type":"int"}, {"name":"favorite_color","type":"string"}]} avro.codec snappy ``` （6）获取user.avro的schema ``` -- 查看 [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar getschema /hdatas/user.avro { "type" : "record", "name" : "User", "namespace" : "example.avro", "fields" : [ { "name" : "name", "type" : "string" }, { "name" : "favorite_number", "type" : "int" }, { "name" : "favorite_color", "type" : "string" } ] } -- 将输出存储到user_002.avsc文件中 [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar getschema /hdatas/user.avro > /hdatas/user_002.avsc ``` **查看有哪些命令** ```sql [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar Version 1.8.2 of Apache Avro Copyright 2010-2015 The Apache Software Foundation This product includes software developed at The Apache Software Foundation (http://www.apache.org/). ---------------- Available tools: cat extracts samples from files compile Generates Java code for the given schema. concat Concatenates avro files without re-compressing. fragtojson Renders a binary-encoded Avro datum as JSON. fromjson Reads JSON records and writes an Avro data file. fromtext Imports a text file into an avro data file. getmeta Prints out the metadata of an Avro data file. getschema Prints out schema of an Avro data file. idl Generates a JSON schema from an Avro IDL file idl2schemata Extract JSON schemata of the types from an Avro IDL file induce Induce schema/protocol from Java class/interface via reflection. jsontofrag Renders a JSON-encoded Avro datum as binary. random Creates a file with randomly generated instances of a schema. recodec Alters the codec of a data file. repair Recovers data from a corrupt Avro Data file rpcprotocol Output the protocol of a RPC service rpcreceive Opens an RPC Server and listens for one message. rpcsend Sends a single RPC message. tether Run a tethered mapreduce job. tojson Dumps an Avro data file as JSON, record per line or pretty. totext Converts an Avro data file to a text file. totrevni Converts an Avro data file to a Trevni file. trevni_meta Dumps a Trevni file's metadata as JSON. trevni_random Create a Trevni file filled with random instances of a schema. trevni_tojson Dumps a Trevni file as JSON. ``` **查看命令有哪些参数** ```sql [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar fromjson Expected 1 arg: input_file Option Description ------ ----------- --codec Compression codec (default: null) --level <Integer> Compression level (only applies to deflate and xz) (default: -1) --schema Schema --schema-file Schema File ``` # 3. Java 读写Avro 在`pom.xml`中添加如下依赖 ```xml <build> <plugins> <plugin> <groupId>org.apache.avro</groupId> <artifactId>avro-maven-plugin</artifactId> <version>1.10.1</version> <executions> <execution> <phase>generate-sources</phase> <goals> <goal>schema</goal> </goals> <configuration> <sourceDirectory>${project.basedir}/src/main/avro/</sourceDirectory> <outputDirectory>${project.basedir}/src/main/java/</outputDirectory> </configuration> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> </plugins> </build> ``` ## 3.1 使用avro-tools应用生成的代码读写Avro （1）在user.avsc中定义数据存储格式（Schema) ```sql { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": "int"}, {"name": "favorite_color", "type": "string"} ]} ``` （2）Avro可以根据Schema生成对应的java类 ```sql [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar \ compile schema /hdatas/user.avsc /hdatas/User.java ``` 然后会生成/hdatas/User.java/example/avro/User.java文件。（3）然后我们调用生成的User对象来创建Avro ```java package datamodel; import example.avro.User; import org.apache.avro.file.DataFileReader; import org.apache.avro.file.DataFileWriter; import org.apache.avro.io.DatumReader; import org.apache.avro.io.DatumWriter; import org.apache.avro.specific.SpecificDatumReader; import org.apache.avro.specific.SpecificDatumWriter; import org.junit.Test; import java.io.File; import java.io.IOException; public class CreateAvro1 { @Test public void createAvro() throws IOException { // 1. 创建User对象，有下面3中构建方法 User user1 = new User(); user1.setName("Alyssa"); user1.setFavoriteNumber(256); user1.setFavoriteColor("black"); User user2 = new User("Ben", 7, "red"); User user3 = User.newBuilder() .setName("Charlie") .setFavoriteNumber(12) .setFavoriteColor("blue").build(); /* 2. 进行序列化，就是将数据写入user.avro文件中 DatumWriter接口将Java对象转换为内存中的序列化格式； SpecificDatumWriter类用来生成类并制定生成类的类型； DataFileWriter用来进行具体的序列化 */ DatumWriter<User> userDatumWriter = new SpecificDatumWriter<>(User.class); DataFileWriter<User> dataFileWriter = new DataFileWriter<>(userDatumWriter); // 生成user.avro文件 dataFileWriter.create(user1.getSchema(), new File("user.avro")); // 往user.avro中追加数据 dataFileWriter.append(user1); dataFileWriter.append(user2); dataFileWriter.append(user3); // 关闭 dataFileWriter.close(); /* 3. 反序列化，就是将user.avro文件的数据读取出来 */ File file = new File("user.avro"); DatumReader<User> userDatumReader = new SpecificDatumReader<>(User.class); DataFileReader<User> dataFileReader = new DataFileReader<User>(file, userDatumReader); User user = null; while(dataFileReader.hasNext()) { user = dataFileReader.next(user); System.out.println(user); } } } ``` 上面的代码输出如下： ```java {"name": "Alyssa", "favorite_number": 256, "favorite_color": "black"} {"name": "Ben", "favorite_number": 7, "favorite_color": "red"} {"name": "Charlie", "favorite_number": 12, "favorite_color": "blue"} ``` ## 3. 2 自定义代码读写Avro格式下面我们不借助avro-tools工具来生成我们的Avro。（1）在user.avsc中定义数据存储格式（Schema) ```sql { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": "int"}, {"name": "favorite_color", "type": "string"} ]} ``` （2）Java代码 ```java package datamodel; import org.apache.avro.Schema; import org.apache.avro.file.DataFileReader; import org.apache.avro.file.DataFileWriter; import org.apache.avro.generic.GenericData; import org.apache.avro.generic.GenericRecord; import org.apache.avro.io.DatumReader; import org.apache.avro.io.DatumWriter; import org.apache.avro.specific.SpecificDatumReader; import org.apache.avro.specific.SpecificDatumWriter; import org.junit.Test; import java.io.File; import java.io.IOException; public class CreateAvro2 { @Test public void createAvro() throws IOException { // 1. 获取user.avsc中Schema信息 Schema schema = new Schema.Parser().parse(new File("user.avsc")); // 2. 创建record GenericRecord user1 = new GenericData.Record(schema); user1.put("name", "Alyssa"); user1.put("favorite_number", 256); user1.put("favorite_color", "black"); GenericRecord user2 = new GenericData.Record(schema); user2.put("name", "Ben"); user2.put("favorite_number", 7); user2.put("favorite_color", "red"); GenericRecord user3 = new GenericData.Record(schema); user3.put("name", "Charlie"); user3.put("favorite_number", 12); user3.put("favorite_color", "blue"); /* 3. 序列化, 就是将数据写入user.avro文件中 DatumWriter接口将Java对象转换为内存中的序列化格式； SpecificDatumWriter类用来生成类并制定生成类的类型； DataFileWriter用来进行具体的序列化 */ DatumWriter<GenericRecord> userDatumWriter = new SpecificDatumWriter<>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(userDatumWriter); // 生成user.avro文件 dataFileWriter.create(user1.getSchema(), new File("user.avro")); // 往user.avro中追加数据 dataFileWriter.append(user1); dataFileWriter.append(user2); dataFileWriter.append(user3); // 关闭 dataFileWriter.close(); // 4. 反序列化, 就是将user.avro文件的数据读取出来 File file = new File("user.avro"); DatumReader<GenericRecord> userDatumReader = new SpecificDatumReader<>(schema); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(file, userDatumReader); GenericRecord user = null; while (dataFileReader.hasNext()) { user = dataFileReader.next(user); System.out.println(user); } } } ``` 上面的代码输出如下： ```java {"name": "Alyssa", "favorite_number": 256, "favorite_color": "black"} {"name": "Ben", "favorite_number": 7, "favorite_color": "red"} {"name": "Charlie", "favorite_number": 12, "favorite_color": "blue"} ``` # 4. 在Hive将Avro作为存储模型 ```sql -- 方式一 create external table user_avro_ext( name string, favorite_number int, favorite_color string ) stored as avro; -- 方式二 create table customers row format serde 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' stored as inputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' tblproperties ('avro.schema.literal'='{ "name": "customer", "type": "record", "fields": [ {"name":"firstName", "type":"string"}, {"name":"lastName", "type":"string"}, {"name":"age", "type":"int"}, {"name":"salary", "type":"double"}, {"name":"department", "type":"string"}, {"name":"title", "type":"string"}, {"name": "address", "type": "string"}]}'); ```