Apache Pig Join运算符

Apache Pig

Apache Pig教程 Apache Pig 概述 Apache Pig 架构 Apache Pig 安装 Apache Pig 执行 Apache Pig Grunt Shell Pig Latin 基础 Apache Pig 加载数据 Apache Pig 存储数据 Apache Pig Diagnostic运算符 Apache Pig Describe运算符 Apache Pig Explain运算符 Apache Pig illustrate运算符 Apache Pig Group运算符 Apache Pig Cogroup运算符 Apache Pig Join运算符 Apache Pig Cross运算符 Apache Pig Union运算符 Apache Pig Split运算符 Apache Pig Filter运算符 Apache Pig Distinct运算符 Apache Pig Foreach运算符 Apache Pig Order By运算符 Apache Pig Limit运算符 Apache Pig Eval函数 Apache Pig 加载和存储函数 Apache Pig 包和元组函数 Apache Pig 字符串函数 Apache Pig 日期时间函数 Apache Pig 数学函数 Apache Pig 用户定义函数（UDF） Apache Pig 运行脚本

Apache Pig Join运算符

JOIN 运算符用于组合来自两个或多个关系的记录。在执行连接操作时，我们从每个关系中声明一个（或一组）元组作为key。当这些key匹配时，两个特定的元组匹配，否则记录将被丢弃。连接可以是以下类型：

Self-join
Inner-join
Outer-join − left join, right join, and full join

本章介绍了如何在Pig Latin中使用join运算符的示例。假设在HDFS的 /pig_data/ 目录中有两个文件，即customers.txt 和 orders.txt ，如下所示。

customers.txt

1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00 
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00

orders.txt

102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060

我们将这两个文件与 customers 和 orders 关系一起加载到Pig中，如下所示。

grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, address:chararray, salary:int);
  
grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',')
   as (oid:int, date:chararray, customer_id:int, amount:int);

现在让我们对这两个关系执行各种连接操作。

Self-join（自连接）

Self-join 用于将表与其自身连接，就像表是两个关系一样，临时重命名至少一个关系。通常，在Apache Pig中，为了执行self-join，我们将在不同的别名（名称）下多次加载相同的数据。那么，将文件 customers.txt 的内容加载为两个表，如下所示。

grunt> customers1 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, address:chararray, salary:int);
  
grunt> customers2 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, address:chararray, salary:int);

语法

下面给出使用 JOIN 运算符执行self-join操作的语法。

grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;

例

通过如图所示加入两个关系 customers1 和 customers2 ，对关系 customers 执行self-join 操作。

grunt> customers3 = JOIN customers1 BY id, customers2 BY id;

验证

使用 DUMP 运算符验证关系 customers3 ，如下所示。

grunt> Dump customers3;

输出

将产生以下输出，显示关系 customers 的内容。

(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)

Inner Join（内部连接）

Inner Join使用较为频繁；它也被称为等值连接。当两个表中都存在匹配时，内部连接将返回行。基于连接谓词（join-predicate），通过组合两个关系（例如A和B）的列值来创建新关系。查询将A的每一行与B的每一行进行比较，以查找满足连接谓词的所有行对。当连接谓词被满足时，A和B的每个匹配的行对的列值被组合成结果行。

语法

以下是使用 JOIN 运算符执行inner join操作的语法。

grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;

例

让我们对customers和orders执行inner join操作，如下所示。

grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;

验证

使用 DUMP 运算符验证 coustomer_orders 关系，如下所示。

grunt> Dump coustomer_orders;

输出

将获得以下输出，是名为 coustomer_orders 的关系的内容。

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

注意：

Outer Join：与inner join不同，outer join返回至少一个关系中的所有行。outer join操作以三种方式执行：

Left outer join
Right outer join
Full outer join

Left Outer Join（左外连接）

left outer join操作返回左表中的所有行，即使右边的关系中没有匹配项。

语法

下面给出使用 JOIN 运算符执行left outer join操作的语法。

grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;

例

让我们对customers和orders的两个关系执行left outer join操作，如下所示。

grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;

验证

使用 DUMP 运算符验证关系 outer_left ，如下所示。

grunt> Dump outer_left;

输出

它将产生以下输出，显示关系 outer_left 的内容。

(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)

Right Outer Join（右外连接）

right outer join操作将返回右表中的所有行，即使左表中没有匹配项。

语法

下面给出使用 JOIN 运算符执行right outer join操作的语法。

grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;

例

让我们对customers和orders执行right outer join操作，如下所示。

grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;

验证

使用 DUMP 运算符验证关系 outer_right ，如下所示。

grunt> Dump outer_right

输出

它将产生以下输出，显示关系 outer_right 的内容。

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Full Outer Join（全外连接）

当一个关系中存在匹配时，full outer join操作将返回行。

语法

下面给出使用 JOIN 运算符执行full outer join的语法。

grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

例

让我们对customers和orders执行full outer join操作，如下所示。

grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

验证

使用 DUMP 运算符验证关系 outer_full ，如下所示。

grun> Dump outer_full;

输出

它将产生以下输出，显示关系 outer_full 的内容。

(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)

使用多个Key

我们可以使用多个key执行JOIN操作。

语法

下面是如何使用多个key对两个表执行JOIN操作。

grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2);

假设在HDFS的 /pig_data/ 目录中有两个文件，即 employee.txt 和 employee_contact.txt ，如下所示。

employee.txt

001,Rajiv,Reddy,21,programmer,003
002,siddarth,Battacharya,22,programmer,003
003,Rajesh,Khanna,22,programmer,003
004,Preethi,Agarwal,21,programmer,003
005,Trupthi,Mohanthy,23,programmer,003
006,Archana,Mishra,23,programmer,003
007,Komal,Nayak,24,teamlead,002
008,Bharathi,Nambiayar,24,manager,001

employee_contact.txt

001,9848022337,Rajiv@gmail.com,Hyderabad,003
002,9848022338,siddarth@gmail.com,Kolkata,003
003,9848022339,Rajesh@gmail.com,Delhi,003
004,9848022330,Preethi@gmail.com,Pune,003
005,9848022336,Trupthi@gmail.com,Bhuwaneshwar,003
006,9848022335,Archana@gmail.com,Chennai,003
007,9848022334,Komal@gmail.com,trivendram,002
008,9848022333,Bharathi@gmail.com,Chennai,001

将这两个文件加载到Pig中，通过关系 employee 和 employee_contact ，如下所示。

grunt> employee = LOAD 'hdfs://localhost:9000/pig_data/employee.txt' USING PigStorage(',')
   as (id:int, firstname:chararray, lastname:chararray, age:int, designation:chararray, jobid:int);
  
grunt> employee_contact = LOAD 'hdfs://localhost:9000/pig_data/employee_contact.txt' USING PigStorage(',') 
   as (id:int, phone:chararray, email:chararray, city:chararray, jobid:int);

现在，让我们使用 JOIN 运算符连接这两个关系的内容，如下所示。

grunt> emp = JOIN employee BY (id,jobid), employee_contact BY (id,jobid);

验证

使用 DUMP 运算符验证关系 emp ，如下所示。

grunt> Dump emp;

输出

它将产生以下输出，显示名为 emp 的关系的内容，如下所示。

(1,Rajiv,Reddy,21,programmer,113,1,9848022337,Rajiv@gmail.com,Hyderabad,113)
(2,siddarth,Battacharya,22,programmer,113,2,9848022338,siddarth@gmail.com,Kolka ta,113)  
(3,Rajesh,Khanna,22,programmer,113,3,9848022339,Rajesh@gmail.com,Delhi,113)  
(4,Preethi,Agarwal,21,programmer,113,4,9848022330,Preethi@gmail.com,Pune,113)  
(5,Trupthi,Mohanthy,23,programmer,113,5,9848022336,Trupthi@gmail.com,Bhuwaneshw ar,113)  
(6,Archana,Mishra,23,programmer,113,6,9848022335,Archana@gmail.com,Chennai,113)  
(7,Komal,Nayak,24,teamlead,112,7,9848022334,Komal@gmail.com,trivendram,112)  
(8,Bharathi,Nambiayar,24,manager,111,8,9848022333,Bharathi@gmail.com,Chennai,111)

上一篇:Apache Pig Cogroup运算符

下一篇:Apache Pig Cross运算符

我要发贴

Apache Pig