当前位置:首页 > 分布式系统论文报告(英文)
Yahoo S4 stream computing platform
114106000699 陈娜
S4(Simple Scalable Streaming System) is initially a platform developed by Yahoo to improve the effective clicking rates of searching ADs. Through the analysis of users’ clicking rates of ADs and removing the low correlation degree of it, S4 promotes the clicking rates of ADs. So it can be regarded as a distributed stream computing model.
S4 is applied to the streaming data and real-time processing. So when it comes to business needing real-time processing, you can analyze data efficiently. Once the system has been online, rarely does it require human intervention. A steady stream of data will be analyzed and automatically routed. For huge amounts of data, S4 can process data faster. But the disadvantage is that currently the S4 data transmission is not so reliable that you may lose data. Because the data is stored in memory, all of the data in the node will be lost when the node breaks down. What’s more, S4 also has a relationship-oriented scenario. Real-time data analysis is usually for some discrete and small data. From a statistical point of view, losing part of data has no significant impact on the final results. In contrast, it can improve output significantly. So for now, S4 is more suitable for those scenes which do not need a careful analysis of each data, but only the last survey results to make appropriate adjustments and expect of the business.
When the system runs, due to the nodes are invalid and exit by other reasons, S4 still sends lots of events to the disabled node so that massive incident are missing. Because distributed stream computing framework S4 take the event key values and the number of nodes to obtain the mark of destination node, when exiting nodes, the number of nodes do not set mechanism corresponds to the change, resulting in the original processing node mark is normally hashed to and a new event will be sent to a large number of disabled nodes.Based on the above disadvantages, I put forward a dynamic node removing requirement. When a distributed stream computing framework is already running and the business does not interrupt, if the nodes are invalid and exit by other reasons, other nodes in the distributed stream computing framework can sense the new node exiting in a short period of time, and can share the exit node’s work to other nodes as soon as possible, in order to avoid a large number of new events sent to the exit node caused the loss of a large number of events for sake of ensuring the distributed stream computing framework achieving load balance after the node removed in a short period of time.Because the failure or system administrator takes into account to the replacement of the old node, the node can be exited. And for the S4 system, in order to reduce the error rate, each node is extended to two and two nodes in the content is completely consistent. When a node breaks down, the system
can stop work and add the other node for replacing the old one. So in a small system, the cost will not increase too much, but stopping and restarting nodes, real-time will decline. It is a research program that can be considered under certain conditions.
S4 system is asked to input the event streaming which involves the generation of events. So before the data streaming gets into S4, S4 must be able to have a system as the intermediate processing system which transforms the data streaming into the event.From the view of the cluster’ expansibility, you can handle the greater data streaming by adding nodes, however, now you can’t dynamically increase or decrease nodes. When adjusting the nodes, it may be necessary to stop the current work that is to say never do the seamless adjustment.In addition, because S4 still can’t guarantee the data transmission of 100% reliability, when the size of the cluster increases, data errors will grow rapidly. It is worth exploring how big the size of the S4 cluster can be done exactly. If the data transmission reliability promotes, S4 will play better results.
In order to protect the reliability of the data transmission, S4 supports the UDP and TCP protocol. In the aspect of the coupling degree, S4 completely isolates the platform and business logic which only needs to write PE logic, so the coupling degree of the business and platform is very low.The design of S4 is based on the combination of MapReduce and Actor mode. Because of its equivalent structure, the design of S4 achieves a very high degree of simplicity. All nodes in the cluster are equivalent and have no center control. In other words, it is a simple cluster management service which can be shared with multiple data center systems.
A stream is abstracted by S4 as a sequence composed of elements in the form of (K, A).Here, K is key and A is attribute. On the basis of abstraction, S4 is designed to consume and deliver the component of (K, A) elements that is Process Element.Process element in the S4 is the minimum data processing unit. Each PE instance refers to the event which consuming event type, the key attribute and the value attribute are matched, and finally it outputs results or new (K, A) elements.S4 will divide stream processing into multiple stream events. It abstracted the stream events as directed edges of processing graph that is represented by the form of (K, A). This representation in such a way makes the transformation of events very convenient which is a kind of design from the MapReduce (key, value). At the same time, because the stream is divided into multiple stream events, S4 system needs to correspond to a plurality of processing units. Each PE handles an only event and every PE is independent, which greatly reduces the complexity of concept and system.
共分享92篇相关文档