Java知识分享网 - 轻松学习从此开始!    

Java知识分享网

Java1234官方群25:java1234官方群17
Java1234官方群25:838462530
        
SpringBoot+SpringSecurity+Vue+ElementPlus权限系统实战课程 震撼发布        

最新Java全栈就业实战课程(免费)

springcloud分布式电商秒杀实战课程

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦!

Python学习路线图

锋哥开始收Java学员啦!

Impala A Modern, Open-Source SQL Engine for Hadoop PDF 下载


分享到:
时间:2020-04-22 10:36来源:http://www.java1234.com 作者:小锋  侵权举报
Impala A Modern, Open-Source SQL Engine for Hadoop PDF 下载
失效链接处理
Impala A Modern, Open-Source SQL Engine for Hadoop PDF 下载

本站整理下载:
 
相关截图:
 
主要内容:

Cloudera Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data
processing environment. Impala provides low latency and
high concurrency for BI/analytic read-mostly queries on
Hadoop, not delivered by batch frameworks such as Apache
Hive. This paper presents Impala from a user’s perspective,
gives an overview of its architecture and main components
and briefly demonstrates its superior performance compared
against other popular SQL-on-Hadoop systems.
1. INTRODUCTION
Impala is an open-source 1
, fully-integrated, state-of-theart MPP SQL query engine designed specifically to leverage
the flexibility and scalability of Hadoop. Impala’s goal is
to combine the familiar SQL support and multi-user performance of a traditional analytic database with the scalability
and flexibility of Apache Hadoop and the production-grade
security and management extensions of Cloudera Enterprise.
Impala’s beta release was in October 2012 and it GA’ed in
May 2013. The most recent version, Impala 2.0, was released
in October 2014. Impala’s ecosystem momentum continues
to accelerate, with nearly one million downloads since its
GA.
Unlike other systems (often forks of Postgres), Impala is a
brand-new engine, written from the ground up in C++ and
Java. It maintains Hadoop’s flexibility by utilizing standard
components (HDFS, HBase, Metastore, YARN, Sentry) and
is able to read the majority of the widely-used file formats
(e.g. Parquet, Avro, RCFile). To reduce latency, such as
that incurred from utilizing MapReduce or by reading data
remotely, Impala implements a distributed architecture based
on daemon processes that are responsible for all aspects of
query execution and that run on the same machines as the
rest of the Hadoop infrastructure. The result is performance
1 https://github.com/cloudera/impala
This article is published under a Creative Commons Attribution License(http://creativecommons.org/licenses/by/3.0/), which permits distribution and reproduction in any medium as well as allowing derivative
works, provided that you attribute the original work to the author(s) and
CIDR 2015.
7th Biennial Conference on Innovative Data Systems Research (CIDR’15)
January 4-7, 2015, Asilomar, California, USA.
that is on par or exceeds that of commercial MPP analytic
DBMSs, depending on the particular workload.
This paper discusses the services Impala provides to the
user and then presents an overview of its architecture and
main components. The highest performance that is achievable today requires using HDFS as the underlying storage
manager, and therefore that is the focus on this paper; when
there are notable differences in terms of how certain technical
aspects are handled in conjunction with HBase, we note that
in the text without going into detail.
Impala is the highest performing SQL-on-Hadoop system,
especially under multi-user workloads. As Section 7 shows,
for single-user queries, Impala is up to 13x faster than alternatives, and 6.7x faster on average. For multi-user queries,
the gap widens: Impala is up to 27.4x faster than alternatives,
and 18x faster on average – or nearly three times faster on
average for multi-user queries than for single-user ones.
The remainder of this paper is structured as follows: the
next section gives an overview of Impala from the user’s
perspective and points out how it differs from a traditional
RDBMS. Section 3 presents the overall architecture of the
system. Section 4 presents the frontend component, which
includes a cost-based distributed query optimizer, Section 5
presents the backend component, which is responsible for the
query execution and employs runtime code generation, and
Section 6 presents the resource/workload management component. Section 7 briefly evaluates the performance of Impala. Section 8 discusses the roadmap ahead and Section 9
concludes.
2. USER VIEW OF IMPALA
Impala is a query engine which is integrated into the
Hadoop environment and utilizes a number of standard
Hadoop components (Metastore, HDFS, HBase, YARN, Sentry) in order to deliver an RDBMS-like experience. However,
there are some important differences that will be brought up
in the remainder of this section.
Impala was specifically targeted for integration with standard business intelligence environments, and to that end
supports most relevant industry standards: clients can connect via ODBC or JDBC; authentication is accomplished
with Kerberos or LDAP; authorization follows the standard
SQL roles and privileges 2
. In order to query HDFS-resident
2 This is provided by another standard Hadoop component
called Sentry [4], which also makes role-based authorization available to Hive, and other co

 

------分隔线----------------------------

锋哥公众号


锋哥微信


关注公众号
【Java资料站】
回复 666
获取 
66套java
从菜鸡到大神
项目实战课程

锋哥推荐