Impala A Modern, Open-Source SQL Engine for Hadoop PDF 下载_Java知识分享网-免费Java资源下载

失效链接处理

Impala A Modern, Open-Source SQL Engine for Hadoop PDF 下载

本站整理下载：

链接：https://pan.baidu.com/s/1awfTP10q66UBKlMGop0Vpg

提取码：z6bd

相关截图：

主要内容：

Cloudera Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data

processing environment. Impala provides low latency and

high concurrency for BI/analytic read-mostly queries on

Hadoop, not delivered by batch frameworks such as Apache

Hive. This paper presents Impala from a user’s perspective,

gives an overview of its architecture and main components

and briefly demonstrates its superior performance compared

against other popular SQL-on-Hadoop systems.

1. INTRODUCTION

Impala is an open-source 1

, fully-integrated, state-of-theart MPP SQL query engine designed specifically to leverage

the flexibility and scalability of Hadoop. Impala’s goal is

to combine the familiar SQL support and multi-user performance of a traditional analytic database with the scalability

and flexibility of Apache Hadoop and the production-grade

security and management extensions of Cloudera Enterprise.

Impala’s beta release was in October 2012 and it GA’ed in

May 2013. The most recent version, Impala 2.0, was released

in October 2014. Impala’s ecosystem momentum continues

to accelerate, with nearly one million downloads since its

GA.

Unlike other systems (often forks of Postgres), Impala is a

brand-new engine, written from the ground up in C++ and

Java. It maintains Hadoop’s flexibility by utilizing standard

components (HDFS, HBase, Metastore, YARN, Sentry) and

is able to read the majority of the widely-used file formats

(e.g. Parquet, Avro, RCFile). To reduce latency, such as

that incurred from utilizing MapReduce or by reading data

remotely, Impala implements a distributed architecture based

on daemon processes that are responsible for all aspects of

query execution and that run on the same machines as the

rest of the Hadoop infrastructure. The result is performance

1 https://github.com/cloudera/impala

This article is published under a Creative Commons Attribution License(http://creativecommons.org/licenses/by/3.0/), which permits distribution and reproduction in any medium as well as allowing derivative

works, provided that you attribute the original work to the author(s) and

CIDR 2015.

7th Biennial Conference on Innovative Data Systems Research (CIDR’15)

January 4-7, 2015, Asilomar, California, USA.

that is on par or exceeds that of commercial MPP analytic

DBMSs, depending on the particular workload.

This paper discusses the services Impala provides to the

user and then presents an overview of its architecture and

main components. The highest performance that is achievable today requires using HDFS as the underlying storage

manager, and therefore that is the focus on this paper; when

there are notable differences in terms of how certain technical

aspects are handled in conjunction with HBase, we note that

in the text without going into detail.

Impala is the highest performing SQL-on-Hadoop system,

especially under multi-user workloads. As Section 7 shows,

for single-user queries, Impala is up to 13x faster than alternatives, and 6.7x faster on average. For multi-user queries,

the gap widens: Impala is up to 27.4x faster than alternatives,

and 18x faster on average – or nearly three times faster on

average for multi-user queries than for single-user ones.

The remainder of this paper is structured as follows: the

next section gives an overview of Impala from the user’s

perspective and points out how it differs from a traditional

RDBMS. Section 3 presents the overall architecture of the

system. Section 4 presents the frontend component, which

includes a cost-based distributed query optimizer, Section 5

presents the backend component, which is responsible for the

query execution and employs runtime code generation, and

Section 6 presents the resource/workload management component. Section 7 briefly evaluates the performance of Impala. Section 8 discusses the roadmap ahead and Section 9

concludes.

2. USER VIEW OF IMPALA

Impala is a query engine which is integrated into the

Hadoop environment and utilizes a number of standard

Hadoop components (Metastore, HDFS, HBase, YARN, Sentry) in order to deliver an RDBMS-like experience. However,

there are some important differences that will be brought up

in the remainder of this section.

Impala was specifically targeted for integration with standard business intelligence environments, and to that end

supports most relevant industry standards: clients can connect via ODBC or JDBC; authentication is accomplished

with Kerberos or LDAP; authorization follows the standard

SQL roles and privileges 2

. In order to query HDFS-resident

2 This is provided by another standard Hadoop component

called Sentry [4], which also makes role-based authorization available to Hive, and other co

最新Java全栈就业实战课程(免费)

AI人工智能学习大礼包

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦！

Python学习路线图

Impala A Modern, Open-Source SQL Engine for Hadoop PDF 下载

Java1234官方群25：
Java1234官方群25：	838462530