Apache Kafka PDF 下载_Java知识分享网-免费Java资源下载

失效链接处理

Apache Kafka PDF 下载

本站整理下载：

链接：https://pan.baidu.com/s/1gBgVrQ0wENrjPLLzu0QfRQ

提取码：frvd

相关截图：

主要内容：

Kafka as a Messaging System

How does Kafka's notion of streams compare to a traditional enterprise messaging system?

Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a serv

goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength a

strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you s

Unfortunately, queues aren't multi-subscriber—once one process reads the data it's gone. Publish-subscribe allows you broadc

processes, but has no way of scaling processing since every message goes to every subscriber.

The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divi

over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcas

multiple consumer groups.

The advantage of Kafka's model is that every topic has both these properties—it can scale processing and is also multi-subscr

to choose one or the other.

Kafka has stronger ordering guarantees than a traditional messaging system, too.

A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server h

the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to

may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of par

Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consum

of course this means that there is no parallelism in processing.

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering gu

balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the

that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only re

and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. No

cannot be more consumer instances in a consumer group than partitions.

Kafka as a Storage System

Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage syste

messages. What is different about Kafka is that it is a very good storage system.

Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement

considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.

The disk structures Kafka uses scale well—Kafka will perform the same whether you have 50 KB or 50 TB of persistent data on

As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of

distributed ¦lesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.

For details about the Kafka's commit log storage and replication design, please read this page.

Kafka for Stream Processing

It isn't enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.

In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on

produces continual streams of data to output topics.

For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price

computed off this data.

It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformati

fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of

streams together.

2020/7/7 Apache Kafka

kafka.apache.org/documentation/#connect 7/224

This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code

performing stateful computations, etc.

The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka fo

and uses the same group mechanism for fault tolerance among the stream processor instances.

Putting the Pieces Together

This combination of messaging, storage, and stream processing may seem unusual but it is essential to Kafka's role as a strea

A distributed ¦le system like HDFS allows storing static ¦les for batch processing. Effectively a system like this allows storing

historical data from the past.

A traditional enterprise messaging system allows processing future messages that will arrive after you subscribe. Applications

process future data as it arrives.

Kafka combines both of these capabilities, and the combination is critical both for Kafka usage as a platform for streaming ap

for streaming data pipelines.

By combining storage and low-latency subscriptions, streaming applications can treat both past and future data the same way

application can process historical, stored data but rather than ending when it reaches the last record it can keep processing as

This is a generalized notion of stream processing that subsumes batch processing as well as message-driven applications.

Likewise for streaming data pipelines the combination of subscription to real-time events make it possible to use Kafka for ver

pipelines; but the ability to store data reliably make it possible to use it for critical data where the delivery of data must be guar

integration with o©ine systems that load data only periodically or may go down for extended periods of time for maintenance.

processing facilities make it possible to transform data as it arrives.

For more information on the guarantees, APIs, and capabilities Kafka provides see the rest of the documentation.

1.2 Use Cases

Here is a description of a few of the popular use cases for Apache Kafka®. For an overview of a number of these areas in actio

Messaging

Kafka works well as a replacement for a more traditional message broker. Message brokers are used for a variety of reasons (t

processing from data producers, to buffer unprocessed messages, etc). In comparison to most messaging systems Kafka has

built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applica

In our experience messaging uses are often comparatively low-throughput, but may require low end-to-end latency and often d

durability guarantees Kafka provides.

In this domain Kafka is comparable to traditional messaging systems such as ActiveMQ or RabbitMQ. Website Activity Tracking

The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe

site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity typ

available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoo

warehousing systems for o©ine processing and reporting.

Activity tracking is often very high volume as many activity messages are generated for each user page view.

Metrics

Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produ

of operational data.

Log Aggregation

2020/7/7 Apache Kafka

kafka.apache.org/documentation/#connect 8/224

Many people use Kafka as a replacement for a log aggregation solution. Log aggregation typically collects physical log ¦les of

them in a central place (a ¦le server or HDFS perhaps) for processing. Kafka abstracts away the details of ¦les and gives a clea

log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sourc

data consumption. In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger

guarantees due to replication, and much lower end-to-end latency.

Stream Processing

Many users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed fro

then aggregated, enriched, or otherwise transformed into new topics for further consumption or follow-up processing. For exam

pipeline for recommending news articles might crawl article content from RSS feeds and publish it to an "articles" topic; furthe

normalize or deduplicate this content and publish the cleansed article content to a new topic; a ¦nal processing stage might at

this content to users. Such processing pipelines create graphs of real-time data §ows based on the individual topics. Starting i

weight but powerful stream processing library called Kafka Streams is available in Apache Kafka to perform such data process

above. Apart from Kafka Streams, alternative open source stream processing tools include Apache Storm and Apache Samza. Event Sourcing

Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records. Kafka's

large stored log data makes it an excellent backend for an application built in this style.

Commit Log

Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and act

mechanism for failed nodes to restore their data. The log compaction feature in Kafka helps support this usage. In this usage K

Apache BookKeeper project.

1.3 Quick Start

This tutorial assumes you are starting fresh and have no existing Kafka or ZooKeeper data. Since Kafka console scripts are dif

and Windows platforms, on Windows platforms use bin\windows\ instead of bin/ , and change the script extension to

Step 1: Download the code

Download the 2.5.0 release and un-tar it.

Step 2: Start the server

Kafka uses ZooKeeper so you need to ¦rst start a ZooKeeper server if you don't already have one. You can use the convenience

with kafka to get a quick-and-dirty single-node ZooKeeper instance.

Now start the Kafka server:

Step 3: Create a topic

Let's create a topic named "test" with a single partition and only one replica:

We can now see that topic if we run the list topic command:

> tar -xzf kafka_2.12-2.5.0.tgz

> cd kafka_2.12-2.5.0

123

> bin/zookeeper-server-start.sh config/zookeeper.properties

[2013-04-22 15:01:37,495] INFO Reading configuration from: config/zookeeper.properties (org.apache.zooke

...

1234

> bin/kafka-server-start.sh config/server.properties

[2013-04-22 15:01:47,028] INFO Verifying properties (kafka.utils.VerifiableProperties)

[2013-04-22 15:01:47,051] INFO Property socket.send.buffer.bytes is overridden to 1048576 (kafka.utils.V

...

1 > bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 -

2020/7/7 Apache Kafka

kafka.apache.org/documentation/#connect 9/224

Alternatively, instead of manually creating topics you can also con¦gure your brokers to auto-create topics when a non-existent

to.

Step 4: Send some messages

Kafka comes with a command line client that will take input from a ¦le or from standard input and send it out as messages to t

default, each line will be sent as a separate message.

Run the producer and then type a few messages into the console to send to the server.

Step 5: Start a consumer

Kafka also has a command line consumer that will dump out messages to standard output.

If you have each of the above commands running in a different terminal then you should now be able to type messages into the

and see them appear in the consumer terminal.

All of the command line tools have additional options; running the command with no arguments will display usage information

in more detail.

Step 6: Setting up a multi-broker cluster

So far we have been running against a single broker, but that's no fun. For Kafka, a single broker is just a cluster of size one, so

changes other than starting a few more broker instances. But just to get feel for it, let's expand our cluster to three nodes (still

machine).

First we make a con¦g ¦le for each of the brokers (on Windows use the copy command instead):

Now edit these new ¦les and set the following properties:

The broker.id property is the unique and permanent name of each node in the cluster. We have to override the port and log

because we are running these all on the same machine and we want to keep the brokers from all trying to register on the same

each other's data.

We already have Zookeeper and our single node started, so we just need to start the two new nodes:

Now create a new topic with a replication factor of three:

Okay but now that we have a cluster how can we know which broker is doing what? To see that run the "describe topics" comm

> bin/kafka-topics.sh --list --bootstrap-server localhost:9092

test

123

> bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic test

This is a message

This is another message

123

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning

This is a message

This is another message

> cp config/server.properties config/server-1.properties

> cp config/server.properties config/server-2.properties

123456789

config/server-1.properties:

broker.id=1

listeners=PLAINTEXT://:9093

log.dirs=/tmp/kafka-logs-1

config/server-2.properties:

broker.id=2

listeners=PLAINTEXT://:9094

log.dirs=/tmp/kafka-logs-2

1234

> bin/kafka-server-start.sh config/server-1.properties &

...

> bin/kafka-server-start.sh config/server-2.properties &

...

1 > bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 3 --partitions 1 -

2020/7/7 Apache Kafka

kafka.apache.org/documentation/#connect 10/224

Here is an explanation of output. The ¦rst line gives a summary of all the partitions, each additional line gives information abou

we have only one partition for this topic there is only one line.

"leader" is the node responsible for all reads and writes for the given partition. Each node will be the leader for a randomly s

partitions.

"replicas" is the list of nodes that replicate the log for this partition regardless of whether they are the leader or even if they a

"isr" is the set of "in-sync" replicas. This is the subset of the replicas list that is currently alive and caught-up to the leader.

Note that in my example node 1 is the leader for the only partition of the topic.

We can run the same command on the original topic we created to see where it is:

So there is no surprise there—the original topic has no replicas and is on server 0, the only server in our cluster when we create

Let's publish a few messages to our new topic:

Now let's consume these messages:

Now let's test out fault-tolerance. Broker 1 was acting as the leader so let's kill it:

最新Java全栈就业实战课程(免费)

AI人工智能学习大礼包

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦！

Python学习路线图

Apache Kafka PDF 下载

Java1234官方群25：
Java1234官方群25：	838462530