采集文件到HDFS · 大数据

[TOC] # 分析采集需求：比如业务系统使用log4j生成的日志，日志内容不断增加，需要把追加到日志文件中的数据实时采集到hdfs ![](https://box.kancloud.cn/d678f7412c66a41023bc1453cfdc6669_666x263.png) 根据需求，首先定义以下3大要素 * 采集源，即source——监控文件内容更新 : exec ‘tail -F file’ * 下沉目标，即sink——HDFS文件系统 : hdfs sink * Source和sink之间的传递通道——channel，可用file channel 也可以用内存channel # 配置文件 ~~~ # 定义名称 agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 # Describe/configure tail -F source1 # 定义source # source的类型的exec,这是个命令行,需要个命令 agent1.sources.source1.type = exec # tail -F监控这个文件的新增的变化 agent1.sources.source1.command = tail -F /root/hadoop2/logs/access_log #configure host for source # 使用2个拦截器,i1和i2 agent1.sources.source1.interceptors = i1 i2 # 类型是host agent1.sources.source1.interceptors.i1.type = host # 解析对应host里面的hostname agent1.sources.source1.interceptors.i1.hostHeader = hostname # 主机名默认是不是使用ip,如果是false,这解析就是对应的主机名了 agent1.sources.source1.interceptors.i1.userIP=true agent1.sources.source1.interceptors.i2.type = timestamp # Describe sink1 agent1.sinks.sink1.type = hdfs # 这边写hdfs的 agent1.sinks.sink1.hdfs.path=hdfs://master:9000/file/%{hostname}/%y-%m-%d/%H-%M agent1.sinks.sink1.hdfs.filePrefix = access_log agent1.sinks.sink1.hdfs.batchSize= 100 agent1.sinks.sink1.hdfs.fileType = DataStream agent1.sinks.sink1.hdfs.writeFormat =Text agent1.sinks.sink1.hdfs.rollSize = 10240 agent1.sinks.sink1.hdfs.rollCount = 1000 agent1.sinks.sink1.hdfs.rollInterval = 10 agent1.sinks.sink1.hdfs.round = true agent1.sinks.sink1.hdfs.roundValue = 10 agent1.sinks.sink1.hdfs.roundUnit = minute # Use a channel which buffers events in memory agent1.channels.channel1.type = memory agent1.channels.channel1.keep-alive = 120 agent1.channels.channel1.capacity = 500000 agent1.channels.channel1.transactionCapacity = 600 # Bind the source and sink to the channel agent1.sources.source1.channels = channel1 agent1.sinks.sink1.channel = channel1 ~~~ # 测试启动 ~~~ flume-ng agent -c conf -f fhd.conf -n agent1 -Dflume.root.logger=INFO,console ~~~ fhd.conf换成你自己写的,不同的目录加上目录 -n代表上面定义的agent的名字启动后可以看到打印的日志,只要`/root/data/`下面有文件就会移动到hdfs `/root/hadoop2/logs/access_log`这个文件有改动就会记录上传到hdfs