此文章是vip文章,如何查看?  

1,点击链接获取密钥 http://nicethemes.cn/product/view29882.html

2,在下方输入文章查看密钥即可立即查看当前vip文章


Pig安装及测试

  • 时间:
  • 浏览:
  • 来源:互联网

Apache Pig介绍
Pig是Apache基金会的一个项目,它是一个大型数据集分析的平台,使用过MapReduce的程序员都知道,面对复杂的数据集常常需要编写多个MapReduce过程方能达到目的。Pig正是为了解决这个问题而产生的,它包括两个部分:
*Pig Latin:描述数据流的文本语言;
*运行Pig Latin程序的执行环境:产生MapReduce程序的编译器。
Pig具有三个特性:
(1)易编程。Pig Latin程序由一系列的“操作”或“变换”构成,实际上通过“操作”将MapRecude程序变成数据流,使得实现简单的和并行要求高的数据分析任务变得非常容易,在它所提供的Pig Latin控制台上,可以用几行Pig Latin代码轻松完成TB级的数据集处理任务。
(2)自动优化。系统会对编写的Pig Latin代码自动进行优化,程序员就可以省去优化过程,不必关心效率问题,将大量的时间专注与分析语义方面。
(3)扩展性好。程序员可以按照自己的需求编写自定义函数。其载入(load)、存储(store)、过滤(filter)、连接(join)过程均可定制。

Pig的安装
下载Pig安装版本pig-0.13.0.tar.gz并上传到集群
这里写图片描述

解压命令 :

tar -xzf pig-0.13.0.tar.gz

配置Pig的环境变量,命令:

vi /home/gznc/.bash_profile 

添加如下几行

export PIG_HOME=/home/gznc/pig-0.13.0
export PATH=$PIG_HOME/bin:$PIG_HOME/conf:$PATH
export PIG_CLASSPATH=$HADOOP_HOME/etc/hadoop

这里写图片描述

启动配置文件,命令:

source /home/gznc/.bash_profile 

验证配置,命令:pig -help 出现如下内容,则配置成功

[gznc@master ~]$ pig -help

Apache Pig version 0.13.0 (r1606446) 
compiled Jun 29 2014, 02:27:58

USAGE: Pig [options] [-] : Run interactively in grunt shell.
       Pig [options] -e[xecute] cmd [cmd ...] : Run cmd(s).
       Pig [options] [-f[ile]] file : Run cmds found in file.
  options include:
    -4, -log4jconf - Log4j configuration file, overrides log conf
    -b, -brief - Brief logging (no timestamps)
    -c, -check - Syntax check
    -d, -debug - Debug level, INFO is default
    -e, -execute - Commands to execute (within quotes)
    -f, -file - Path to the script to execute
    -g, -embedded - ScriptEngine classname or keyword for the ScriptEngine
    -h, -help - Display this message. You can specify topic to get help for that topic.
        properties is the only topic currently supported: -h properties.
    -i, -version - Display version information
    -l, -logfile - Path to client side log file; default is current working directory.
    -m, -param_file - Path to the parameter file
    -p, -param - Key value pair of the form param=val
    -r, -dryrun - Produces script with substituted parameters. Script is not executed.
    -t, -optimizer_off - Turn optimizations off. The following values are supported:
            SplitFilter - Split filter conditions
            PushUpFilter - Filter as early as possible
            MergeFilter - Merge filter conditions
            PushDownForeachFlatten - Join or explode as late as possible
            LimitOptimizer - Limit as early as possible
            ColumnMapKeyPrune - Remove unused data
            AddForEach - Add ForEach to remove unneeded columns
            MergeForEach - Merge adjacent ForEach
            GroupByConstParallelSetter - Force parallel 1 for "group all" statement
            All - Disable all optimizations
        All optimizations listed here are enabled by default. Optimization values are case insensitive.
    -v, -verbose - Print all error messages to screen
    -w, -warning - Turn warning logging on; also turns warning aggregation off
    -x, -exectype - Set execution mode: local|mapreduce, default is mapreduce.
    -F, -stop_on_failure - Aborts execution on the first failed job; default is off
    -M, -no_multiquery - Turn multiquery optimization off; default is on
    -N, -no_fetch - Turn fetch optimization off; default is on
    -P, -propertyFile - Path to property file
    -printCmdDebug - Overrides anything else and prints the actual command used to run Pig, including
                     any environment variables that are set by the pig command.

Pig运行模式
Pig有两种运行模式:Local模式和MapReduce模式。Local模式只能访问本地系统文件,一般用于处理小规模的数据集,不需要Hadoop集群环境的支持。MapReduce模式运行于Hadoop集群环境上,Pig将Pig Latin程序编译为MapReduce作业执行

Local模式: pig -x local

 [gznc@master ~]$ pig -x local
16/11/09 13:43:47 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
16/11/09 13:43:47 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType
2016-11-09 13:43:47,413 [main] INFO  org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:27:58
2016-11-09 13:43:47,413 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/gznc/pig_1478670227411.log
2016-11-09 13:43:47,486 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/gznc/.pigbootup not found
2016-11-09 13:43:48,327 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2016-11-09 13:43:48,330 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2016-11-09 13:43:48,358 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/gznc/hadoop-2.5.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/gznc/hbase-0.98.7-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2016-11-09 13:43:49,566 [main] WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-11-09 13:43:49,962 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2016-11-09 13:43:49,996 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

示例:有一段网站的访问日志,用Pig计算出各IP的访问次数。数据下载地址:http://download.csdn.net/detail/qq_33624294/9678198,代码如下,Load后面是本地桌面上的data.txt数据

A = LOAD '/home/gznc/Desktop/data.txt' USING PigStorage(' ')
AS (ip:chararray);
B = FOREACH (
GROUP A BY ip
)
GENERATE group AS ip,COUNT(A) AS clicks;
DUMP B;

结果如下:

(1.207.63.200,9)
(14.29.127.77,32)
(1.204.253.188,18)
(119.0.231.104,33)
(182.118.49.47,4)
(101.199.108.58,4)
(218.201.249.196,9)

如果要提取点击次数最高的前3个IP,则进行如下操作:

C = ORDER B BY clicks DESC;
D = LIMIT C 3;
DUMP D;

结果:

(119.0.231.104,33)
(14.29.127.77,32)
(1.204.253.188,18)

Mapreduce模式: pig -x mapreduce

[gznc@master ~]$ pig -x mapreduce
16/11/09 14:02:51 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
16/11/09 14:02:51 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
16/11/09 14:02:51 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2016-11-09 14:02:51,130 [main] INFO  org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:27:58
2016-11-09 14:02:51,130 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/gznc/pig_1478671371128.log
2016-11-09 14:02:51,187 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/gznc/.pigbootup not found
2016-11-09 14:02:52,659 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2016-11-09 14:02:52,659 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2016-11-09 14:02:52,662 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://master:9000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/gznc/hadoop-2.5.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/gznc/hbase-0.98.7-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2016-11-09 14:02:53,812 [main] WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-11-09 14:02:55,603 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
grunt> 

上传日志数据到集群,命令:

grunt> copyFromLocal /home/gznc/Desktop/data.txt ./

copyFromLocal /home/gznc/Desktop/data.txt ./

查看集群是否存在上传的文件,命令:hadoop fs -ls

这里写图片描述

LOAD后面是集群上的data.txt路径

[gznc@master ~]$ pig -x mapreduce
grunt> A = LOAD 'data.txt' USING PigStorage(' ')                   
>> AS (ip:chararray);
2016-11-09 14:06:03,529 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
grunt> B = FOREACH (
>> GROUP A BY ip
>> )
>> GENERATE group AS ip,COUNT(A) AS clicks;
grunt> 
DUMP B;

结果:

(1.207.63.200,9)
(14.29.127.77,32)
(1.204.253.188,18)
(119.0.231.104,33)
(182.118.49.47,4)
(101.199.108.58,4)
(218.201.249.196,9)

如果要提取点击次数最高的前3个IP,则进行如下操作:

C = ORDER B BY clicks DESC;
D = LIMIT C 3;
DUMP D;

结果:

(119.0.231.104,33)
(14.29.127.77,32)
(1.204.253.188,18)

如果要将结果输出到集群上的文件夹中
用命令store 结果 into 输出的集群路径;
如上述为:store D into /user/gznc/output/ipfile;
退出Pig的命令:quit

Pig Latin编辑器
对于Pig程序员来说,智能的编译器能够起到事半功倍的效果,PigPen就提供了一个Eclipse插件,其包含了Pig脚本编译器和示例生成器。具体请参阅http://wiki.apache.org/pig/PigPen

本文链接http://element-ui.cn/news/show-576955.aspx