Java知识分享网 - 轻松学习从此开始!    

Java知识分享网

Java1234官方群25:java1234官方群17
Java1234官方群25:838462530
        
SpringBoot+SpringSecurity+Vue+ElementPlus权限系统实战课程 震撼发布        

最新Java全栈就业实战课程(免费)

AI人工智能学习大礼包

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦!

Python学习路线图

锋哥开始收Java学员啦!
当前位置: 主页 > Java文档 > Java基础相关 >

基于Kafka与Samza的流式计算架构设计:遵循Unix哲学的高可维护性实时数据管道构建 PDF 下载


分享到:
时间:2025-09-13 09:23来源:http://www.java1234.com 作者:转载  侵权举报
基于Kafka与Samza的流式计算架构设计:遵循Unix哲学的高可维护性实时数据管道构建
失效链接处理
基于Kafka与Samza的流式计算架构设计:遵循Unix哲学的高可维护性实时数据管道构建 PDF 下载 

 
 
相关截图:
 
主要内容:
 

1.1 Implementing Large-Scale Personalized Services
In a large-scale service with many features, the maintainability and the operational robustness of an implementation
 are of paramount importance. The system should have the following properties:
System scalability: Supporting an online service with hundreds of millions of registered users, handling millions
 of requests per second.
Organizational scalability: Allowing hundreds or even thousands of software engineers to work on the system
without excessive coordination overhead.
Operational robustness: If one part of the system is slow or unavailable, the rest of the system should continue
working normally as much as possible.
Large-scale personalized services have been successfully implemented as batch jobs [30], for example using
MapReduce [6]. Performing a recommendation system’s computations in offline batch jobs decouples them from
the online systems that serve user requests, making them easier to maintain and less operationally sensitive.
The main downside of batch jobs is that they introduce a delay between the time the data is collected and
the time its effects are visible. The length of the delay depends on the frequency with which the job is run, but
it is often on the order of hours or days.
Even though MapReduce is a lowest-common-denominator programming model, and has fairly poor performance
 compared to specialized massively parallel database engines [2], it has been a remarkably successful tool
for implementing recommendation systems [30]. Systems such as Spark [34] overcome some of the performance
problems of MapReduce, although they remain batch-oriented.

 

1.2 Batch Workflows
A recommendation and personalization system can be built as a workflow, a directed graph of MapReduce
jobs [30]. Each job reads one or more input datasets (typically directories on the Hadoop Distributed Filesystem,
HDFS), and produces one or more output datasets (in other directories). A job treats its input as immutable
and completely replaces its output. Jobs are chained by directory name: the same name is configured as output
directory for the first job and input directory for the second job.
This method of chaining jobs by directory name is simple, and is expensive in terms of I/O, but it provides
several important benefits:
Multi-consumer. Several different jobs can read the same input directory without affecting each other. Adding
a slow or unreliable consumer affects neither the producer of the dataset, nor other consumers.
Visibility. Every job’s input and output can be inspected by ad-hoc debugging jobs for tracking down the cause
of an error. Inspection of inputs and outputs is also valuable for audit and capacity planning purposes, and
monitoring whether jobs are providing the required level of service.
Team interface. A job operated by one team of people can produce a dataset, and jobs operated by other teams
can consume the dataset. The directory name thus acts as interface between the teams, and it can be
reinforced with a contract (e.g. prescribing the data format, schema, field semantics, partitioning scheme,
and frequency of updates). This arrangement helps organizational scalability.
Loose coupling. Different jobs can be written in different programming languages, using different libraries, but
they can still communicate as long as they can read and write the same file format for inputs and outputs.
A job does not need to know which jobs produce its inputs and consume its outputs. Different jobs can be
run on different schedules, at different priorities, by different users.



 

 
------分隔线----------------------------

锋哥公众号


锋哥微信


关注公众号
【Java资料站】
回复 666
获取 
66套java
从菜鸡到大神
项目实战课程

锋哥推荐