基于Kafka与Samza的流式计算架构设计：遵循Unix哲学的高可维护性实时数据管道构建 PDF 下载

基于Kafka与Samza的流式计算架构设计：遵循Unix哲学的高可维护性实时数据管道构建 PDF 下载

转载自：http://java.python222.com/article/2050

相关截图：

主要内容：

1.1 Implementing Large-Scale Personalized Services
In a large-scale service with many features, the maintainability and the operational robustness of an implementation
are of paramount importance. The system should have the following properties:
System scalability: Supporting an online service with hundreds of millions of registered users, handling millions
of requests per second.
Organizational scalability: Allowing hundreds or even thousands of software engineers to work on the system
without excessive coordination overhead.
Operational robustness: If one part of the system is slow or unavailable, the rest of the system should continue
working normally as much as possible.
Large-scale personalized services have been successfully implemented as batch jobs [30], for example using
MapReduce [6]. Performing a recommendation system’s computations in oﬄine batch jobs decouples them from
the online systems that serve user requests, making them easier to maintain and less operationally sensitive.
The main downside of batch jobs is that they introduce a delay between the time the data is collected and
the time its eﬀects are visible. The length of the delay depends on the frequency with which the job is run, but
it is often on the order of hours or days.
Even though MapReduce is a lowest-common-denominator programming model, and has fairly poor performance
compared to specialized massively parallel database engines [2], it has been a remarkably successful tool
for implementing recommendation systems [30]. Systems such as Spark [34] overcome some of the performance
problems of MapReduce, although they remain batch-oriented.

1.2 Batch Workﬂows
A recommendation and personalization system can be built as a workﬂow, a directed graph of MapReduce
jobs [30]. Each job reads one or more input datasets (typically directories on the Hadoop Distributed Filesystem,
HDFS), and produces one or more output datasets (in other directories). A job treats its input as immutable
and completely replaces its output. Jobs are chained by directory name: the same name is conﬁgured as output
directory for the ﬁrst job and input directory for the second job.
This method of chaining jobs by directory name is simple, and is expensive in terms of I/O, but it provides
several important beneﬁts:
Multi-consumer. Several diﬀerent jobs can read the same input directory without aﬀecting each other. Adding
a slow or unreliable consumer aﬀects neither the producer of the dataset, nor other consumers.
Visibility. Every job’s input and output can be inspected by ad-hoc debugging jobs for tracking down the cause
of an error. Inspection of inputs and outputs is also valuable for audit and capacity planning purposes, and
monitoring whether jobs are providing the required level of service.
Team interface. A job operated by one team of people can produce a dataset, and jobs operated by other teams
can consume the dataset. The directory name thus acts as interface between the teams, and it can be
reinforced with a contract (e.g. prescribing the data format, schema, ﬁeld semantics, partitioning scheme,
and frequency of updates). This arrangement helps organizational scalability.
Loose coupling. Diﬀerent jobs can be written in diﬀerent programming languages, using diﬀerent libraries, but
they can still communicate as long as they can read and write the same ﬁle format for inputs and outputs.
A job does not need to know which jobs produce its inputs and consume its outputs. Diﬀerent jobs can be
run on diﬀerent schedules, at diﬀerent priorities, by diﬀerent users.

最新Java全栈就业实战课程(免费)

AI人工智能学习大礼包

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦！

Python学习路线图

基于Kafka与Samza的流式计算架构设计：遵循Unix哲学的高可维护性实时数据管道构建 PDF 下载