Hadoop使用lzo压缩并支持分片_AaronLwx的博客-程序员资料

技术标签: Hadoop  

[[email protected] hive-1.1.0-cdh5.7.0]$ which lzop
/bin/lzop

[[email protected] data]$ lzop -v page_views_big.dat

[[email protected] data]$ ls -lah
total 1.4G
drwxrwxr-x  2 hadoop hadoop 4.0K Apr 21 18:29 .
drwx------ 12 hadoop hadoop 4.0K Apr 22 01:14 ..
-rw-rw-r--  1 hadoop hadoop  304 Apr 21 18:29 live.txt
-rw-r--r--  1 root   root   455M Apr 19 12:08 login.log
-rw-rw-r--  1 hadoop hadoop 599M Apr 19 18:08 page_views_big.dat
-rw-rw-r--  1 hadoop hadoop 285M Apr 19 18:08 page_views_big.dat.lzo
-rw-r--r--  1 root   root    19M Apr 18 20:47 page_views.dat
-rw-rw-r--  1 hadoop hadoop   44 Apr 18 19:55 wc.txt

[[email protected] maven_repo]$ cd ~/software/

[[email protected] software]$ cd hadoop-lzo/

[[email protected] hadoop-lzo]$ mvn clean package -Dmaven.test.skip=true
[[email protected] target]$ ll
total 456
drwxrwxr-x 2 hadoop hadoop   4096 Apr 19 18:43 antrun
drwxrwxr-x 5 hadoop hadoop   4096 Apr 19 18:43 apidocs
drwxrwxr-x 5 hadoop hadoop   4096 Apr 19 18:43 classes
drwxrwxr-x 3 hadoop hadoop   4096 Apr 19 18:43 generated-sources
-rw-rw-r-- 1 hadoop hadoop 188970 Apr 19 18:43 hadoop-lzo-0.4.21-SNAPSHOT.jar
-rw-rw-r-- 1 hadoop hadoop 184565 Apr 19 18:43 hadoop-lzo-0.4.21-SNAPSHOT-javadoc.jar
-rw-rw-r-- 1 hadoop hadoop  52024 Apr 19 18:43 hadoop-lzo-0.4.21-SNAPSHOT-sources.jar
drwxrwxr-x 2 hadoop hadoop   4096 Apr 19 18:43 javadoc-bundle-options
drwxrwxr-x 2 hadoop hadoop   4096 Apr 19 18:43 maven-archiver
drwxrwxr-x 3 hadoop hadoop   4096 Apr 19 18:43 native
drwxrwxr-x 3 hadoop hadoop   4096 Apr 19 18:43 test-classes

[[email protected] target]$ cp hadoop-lzo-0.4.21-SNAPSHOT.jar ~/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/

[[email protected] common]$ ll
total 5548
-rw-r--r-- 1 hadoop hadoop 3411839 Apr 10 01:41 hadoop-common-2.6.0-cdh5.7.0.jar
-rw-r--r-- 1 hadoop hadoop 1892451 Apr 10 01:41 hadoop-common-2.6.0-cdh5.7.0-tests.jar
-rw-rw-r-- 1 hadoop hadoop  188970 Apr 19 18:47 hadoop-lzo-0.4.21-SNAPSHOT.jar
-rw-r--r-- 1 hadoop hadoop  161018 Apr 10 01:41 hadoop-nfs-2.6.0-cdh5.7.0.jar
drwxr-xr-x 2 hadoop hadoop    4096 Apr 10 01:41 jdiff
drwxr-xr-x 2 hadoop hadoop    4096 Apr 10 01:41 lib
drwxr-xr-x 2 hadoop hadoop    4096 Apr 10 01:41 sources
drwxr-xr-x 2 hadoop hadoop    4096 Apr 10 01:41 templates

[[email protected] hadoop]$ vim core-site.xml

<property>
                  <name>io.compression.codecs</name>
                  <value>org.apache.hadoop.io.compress.GzipCodec,
                        org.apache.hadoop.io.compress.DefaultCodec,
                        org.apache.hadoop.io.compress.BZip2Codec,
                        org.apache.hadoop.io.compress.SnappyCodec,
                        com.hadoop.compression.lzo.LzoCodec,
                        com.hadoop.compression.lzo.LzopCodec
                  </value>
      </property>


<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>


hive> create table page_views_lzo(
    > track_times string,
    > url string,
    > session_id string,
    > referer string,
    > ip string,
    > end_user_id string,
    > city_id string
    > ) row format delimited fields terminated by '\t'
    > STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
    > OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
OK
Time taken: 0.199 seconds
hive> load data local inpath '/home/hadoop/data/page_views_big.dat.lzo' overwrite into table page_views_lzo;

Loading data to table default.page_views_lzo
Table default.page_views_lzo stats: [numFiles=1, numRows=0, totalSize=298200895, rawDataSize=0]
OK
Time taken: 4.064 seconds
[[email protected] data]$ hdfs dfs -ls /user/hive/warehouse/page_views_lzo
Found 1 items
-rwxr-xr-x   1 hadoop supergroup  298200895 2019-04-23 14:28 /user/hive/warehouse/page_views_lzo/page_views_big.dat.lzo

[[email protected] data]$ hdfs dfs -du -s -h /user/hive/warehouse/page_views_lzo
284.4 M  284.4 M  /user/hive/warehouse/page_views_lzo
hive> select count(1) from page_views_lzo;
Query ID = hadoop_20190423142626_386a65de-1dad-4000-b223-15239ce16743
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1556000359234_0001, Tracking URL = http://hadoop004:8088/proxy/application_1556000359234_0001/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1556000359234_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-04-23 14:33:15,184 Stage-1 map = 0%,  reduce = 0%
2019-04-23 14:33:26,643 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 7.06 sec
2019-04-23 14:33:32,982 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.51 sec
MapReduce Total cumulative CPU time: 8 seconds 510 msec
Ended Job = job_1556000359234_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 8.51 sec   HDFS Read: 298207931 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 510 msec
OK
3300000
Time taken: 28.124 seconds, Fetched: 1 row(s)

由倒数几行可以看出

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 8.51 sec   HDFS Read: 298207931 HDFS Write: 8 SUCCESS

这条SQL语句只有一个Map作业,但是page_views_big.dat.lzo这个文件是285M,至少有三个block,按理来说应该有3个split,因此这里说明了不添加索引的lzo默认不支持分片。

下面使lzo支持分片

hive> SET hive.exec.compress.output;
hive.exec.compress.output=false

hive> SET hive.exec.compress.output=true;

hive> SET hive.exec.compress.output;
hive.exec.compress.output=true
hive> SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;

hive> SET mapreduce.output.fileoutputformat.compress.codec;
mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec

hive> create table page_views_lzo_split
    > STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
    > OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
    > as select * from page_views_lzo;

Query ID = hadoop_20190423142626_386a65de-1dad-4000-b223-15239ce16743
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1556000359234_0002, Tracking URL = http://hadoop004:8088/proxy/application_1556000359234_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1556000359234_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-04-23 14:42:08,062 Stage-1 map = 0%,  reduce = 0%
2019-04-23 14:42:18,703 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU 7.42 sec
2019-04-23 14:42:21,813 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU 10.92 sec
2019-04-23 14:42:24,211 Stage-1 map = 81%,  reduce = 0%, Cumulative CPU 14.05 sec
2019-04-23 14:42:26,738 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 16.69 sec
MapReduce Total cumulative CPU time: 16 seconds 690 msec
Ended Job = job_1556000359234_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/.hive-staging_hive_2019-04-23_14-42-01_301_8465660280055053580-1/-ext-10001
Moving data to: hdfs://hadoop004:9000/user/hive/warehouse/page_views_lzo_split
Table default.page_views_lzo_split stats: [numFiles=1, numRows=3300000, totalSize=296148323, rawDataSize=624194769]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 16.69 sec   HDFS Read: 298204253 HDFS Write: 296148419 SUCCESS
Total MapReduce CPU Time Spent: 16 seconds 690 msec
OK
Time taken: 27.738 seconds
[[email protected] data]$ hdfs dfs -du -s -h /user/hive/warehouse/page_views_lzo_split
282.4 M  282.4 M  /user/hive/warehouse/page_views_lzo_split

构建LZO文件索引 

[[email protected] data]$ hadoop jar ~/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/page_views_lzo_split
19/04/23 14:47:58 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
19/04/23 14:47:58 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev f1deea9a313f4017dd5323cb8bbb3732c1aaccc5]
19/04/23 14:47:59 INFO lzo.LzoIndexer: LZO Indexing directory /user/hive/warehouse/page_views_lzo_split...
19/04/23 14:47:59 INFO lzo.LzoIndexer:   [INDEX] LZO Indexing file hdfs://hadoop004:9000/user/hive/warehouse/page_views_lzo_split/000000_0.lzo, size 0.28 GB...
19/04/23 14:47:59 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
19/04/23 14:48:00 INFO lzo.LzoIndexer:   Completed LZO Indexing in 0.72 seconds (393.90 MB/s).  Index size is 19.97 KB.

[[email protected] data]$ hdfs dfs -ls /user/hive/warehouse/page_views_lzo_split
Found 2 items
-rwxr-xr-x   1 hadoop supergroup  296148323 2019-04-23 14:42 /user/hive/warehouse/page_views_lzo_split/000000_0.lzo
-rw-r--r--   1 hadoop supergroup      20448 2019-04-23 14:48 /user/hive/warehouse/page_views_lzo_split/000000_0.lzo.index

hive> select count(1) from page_views_lzo_split;

Query ID = hadoop_20190423142626_386a65de-1dad-4000-b223-15239ce16743
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1556000359234_0003, Tracking URL = http://hadoop004:8088/proxy/application_1556000359234_0003/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1556000359234_0003
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
2019-04-23 14:49:57,100 Stage-1 map = 0%,  reduce = 0%
2019-04-23 14:50:11,166 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 2.27 sec
2019-04-23 14:50:12,201 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 6.27 sec
2019-04-23 14:50:14,285 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 10.41 sec
2019-04-23 14:50:19,470 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 12.18 sec
MapReduce Total cumulative CPU time: 12 seconds 180 msec
Ended Job = job_1556000359234_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 12.18 sec   HDFS Read: 296399059 HDFS Write: 58 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 180 msec
OK
3300000
Time taken: 29.314 seconds, Fetched: 1 row(s)

有上面结果可以看到Map数为3,证明加了索引的lzo支持分片

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/xiaoxiongaa0/article/details/89472629

智能推荐

在idea下ssm整合的配置文件_Intelligent_33的博客-程序员资料

1.整个项目的目录结构如下,便于理解2. 如上图,从上到下进行配置(1)首先的db.properties文件的配置db.propertiesjdbc.driver=com.mysql.jdbc.Driverjdbc.url=jdbc:mysql://localhost:3306/mybatisjdbc.user=rootjdbc.password=root...

kubernetes的云中漫步(五)--kubeadm之dashboard界面部署与使用_zy_xingdian的博客-程序员资料_kubeadm dashboard 语言

kubeadm之dashboard1.因访问dashboard界面时需要使用https,所以在本次测试环境中使用openssl进行数据加密传输:[[email protected] ~]# openssl genrsa -des3 -passout pass:x -out dashboard.pass.key 2048Generating RSA private key, 2048 bit lo...

利用FORCE_MATCHING_SIGNATURE捕获非绑定变量SQL_好记忆不如烂笔头abc的博客-程序员资料

https://www.askmaclean.com/archives/%E5%88%A9%E7%94%A8force_matching_signature%E6%8D%95%E8%8E%B7%E9%9D%9E%E7%BB%91%E5%AE%9A%E5%8F%98%E9%87%8Fsql.htmlselect sql_id, FORCE_MATCHING_SIGNATURE, sql_tex...

Qt4过渡至Qt5_密苏里看雪的博客-程序员资料_qt5 qodbc插件 能否用在qt4

技术在不断进步,新知识也理应不断学习!Qt5的发布带给我无尽的好奇心,然而,受项目影响,一直使用VS2008 + Qt4.8也未曾及时更新。这几天,果断装上VS2010 + Qt5.1,开始研究。Qt4过渡到Qt5不算显著,然而,“模块化”的Qt代码也需要改变项目配置,如使用“headers”,和配置项目构建(如:改变*.pro文件)。QtWidgets作为一个独立的模块

macOS Catalina 10.15.7(19H2)原版CDR镜像_独行秀才的博客-程序员资料_macos catalina 10.15.7(19h2) v50.cdr

Mac 的本领,突飞猛进。[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-J2xKSjHl-1601024149703)(/Volumes/123/123/macOS Catalina-10.15.6(19G2021)]/640.png)音乐、播客,联袂登台iTunes 曾深刻影响了人们的视听娱乐方式。如今,音乐和播客这两款全新 app 携手登场,让一切再次改变。每款 app 都彻彻底底重新设计,只为让你能在 Mac 上尽享娱乐的精彩。请放心,你原来在 iTunes 资料

vue中axios接口的封装_codernmx的博客-程序员资料

api.js单独写在src目录下api目录写个api.jsimport axios from 'axios'// 登录export const index_info = function () { return axios.post('/api/home').then(res =&gt; { console.log(res); return res.data }).catch(err =&gt; { console.log("api登录错误", err) })};在需要调用

随便推点

liunx 添加永久路由最有效方法之一_linux route 永久路由_格格巫 MMQ!!的博客-程序员资料

redhat和centos添加永久路由的方法:vi /etc/sysconfig/static-routes(该文件默认可能没有,自行创建)添加以下任意一条参数即可,两种写法不同但效果一样1.any net 192.168.5.0/24 gw 192.168.2.52.any net 192.168.5.0 netmask 255.255.255.0 gw 192.168.2.5添加后无论发生设备重启还是网络服务重启都会自动添加路由,也就是永久已验证!在 /etc/sysconfig/networ

Vue学习第八章-webpack的使用_CodeKiang的博客-程序员资料

一、了解webpack作用: 进行模块化打包,他会分析你的项目结构,找到JavaScript模块以及其它的一些浏览器不能直接运行的拓展语言(Scss,TypeScript等),并将其打包为合适的格式以供浏览器使用工作方式: 把你的项目当做一个整体,通过一个给定的主文件(如:index.js),Webpack将从这个文件开始,找到你的项目的所有依赖文件,使用loaders处理它们,最后打包为一个...

正则表达式的理解_小姜dot的博客-程序员资料

正则表达式的目的就是匹配字符串,匹配字符串可以是我们简单理解的字符串,例如:"zhangsan"但这不是正则表达式美丽所在,它是通过对其他字符的特殊转义来达到复杂匹配字串的支持。这里介绍一下它所支持的基本转义符1 基本正则式1.1) ^   表示文本行的开头eg: "^a"  表示匹配行的第一个字符为"a"的意思1.2) $   表示文本行的结尾eg: "$a"

lzma算法_push_rbp的博客-程序员资料_lzma

lzma算法分析这几天在公司主要在做压缩相关,记录一下所得。目前业界主流的压缩算法感觉并不多,好用的就Huffman,lz系列,其他的像差分编码,vlq编码,感觉只能做个数据预处理,或者一种小范围的压缩。lz系列有很多,主要有lz77 lz78 lzma,基本思想是一样的,都是一种字典编码,如,我有一段文本,里面有“abcdefabcde”,那么后面的abcde并没有必要,可以用前面的替代,所以,其实可存储为“abcd65”,6代表offset,5代表length,既用距离当前位置6字节,长度为5的字

VS2010,2012,2013自定义注释[代码段]的另种方法_tomisaboy的博客-程序员资料

前段时间,组织了一个小团队,要做一个手机游戏的项目,由于之前用VS2012做C++开发较少,所以遇到了这个问题:怎么在VS里添加自定义注释?其实VS在C#这方面做得很不错,但C++却有点不尽人意。废话不多说,进入正题 以VS2012来说,比如要添加一段自定义注释,如下:/*** 函数名:Func* 作者:小凯* 日期:2014-3-21 11:1

[汇编语言] 汇编语言之IO操作 - 使用直接磁盘服务(Direct Disk Service——INT 13H)_跬步至以千里的博客-程序员资料

一、前言  最近参加三个一学习活动,学到了十七章,由于之前的实验都是在Windows系统下进行的,非常顺利,但这次实验让我吃了鳖,花了两天时间才找到一个不是特别令人满意的解决方案。所以打算记录在本博客,涨涨教训。   首先,阐述一下实验背景和环境,学习汇编语言的环境大多都是Windows或Liunx系统下,使用Dosbox0.74以及汇编语言三件套(masm,link,debug)环境,的确...

推荐文章

热门文章

相关标签