[1] Pål Evensen and Hein Meling. Sensewrap: A service oriented middleware with sensor virtualization and self-configuration. In 5th Int'l Conf. on Intelligent Sensors, Sensor Networks and Information Processing, December 2009. [ bib | DOI | http | .pdf ]
[2] Máté J. Csorba, Hein Meling, and Poul E. Heegaard. Laying pheromone trails for balanced and dependable component mappings. In 4th Int'l Workshop on Self-Organizing Systems, volume 5918 of Lecture Notes in Computer Science, pages 50-64, Zurich, Switzerland, December 2009. IFIP TC 6, Springer-Verlag. [ bib | DOI | http | .pdf ]
[3] Hein Meling and Alberto Montresor. Type-safe dynamic protocol composition in jgroup/arm. In MAI '09: Proceedings of the 3rd International DiscCoTec Workshop on Middleware-Application Interaction, Electronic Communications of the EASST, pages 1-6, Lisbon, Portugal, June 2009. European Association of Software Science and Technology. [ bib | DOI | http | .pdf ]
[4] Hein Meling, Alberto Montresor, Bjarne E. Helvik, and Ozalp Babaoglu. Jgroup/ARM: a distributed object group platform with autonomous replication management. Software: Practice and Experience, 38(9):885-923, July 2008. [ bib | DOI | .pdf ]
This paper presents the design and implementation of Jgroup/ARM, a distributed object group platform with autonomous replication management along with a novel measurement-based assessment technique that is used to validate the fault-handling capability of Jgroup/ARM. Jgroup extends Java RMI through the group communication paradigm and has been designed specifically for application support in partitionable systems. ARM aims at improving the dependability characteristics of systems through a fault-treatment mechanism. Hence, ARM focuses on deployment and operational aspects, where the gain in terms of improved dependability is likely to be the greatest. The main objective of ARM is to localize failures and to reconfigure the system according to application-specific dependability requirements. Combining Jgroup and ARM can significantly reduce the effort necessary for developing, deploying and managing dependable, partition-aware applications. Jgroup/ARM is evaluated experimentally to validate its fault-handling capability; the recovery performance of a system deployed in a wide area network is evaluated. In this experiment multiple nearly coincident reachability changes are injected to emulate network partitionsseparating the service replicas. The results show that Jgroup/ARM is able to recover applications to their initial state in several realistic failure scenarios, including multiple, concurrent network partitionings.

Keywords: fault tolerance, fault treatment, replication and recovery management, measurement-based assessment, middleware, remote method invocation, group communication
[5] Hein Meling and Joakim L. Gilje. A Distributed Approach to Autonomous Fault Treatment in Spread. In Proceedings of the 7th European Dependable Computing Conference (EDCC). IEEE Computer Society, May 2008. [ bib | http | .pdf ]
This paper presents the design and implementation of the Distributed Autonomous Replication Management (DARM) framework built on top of the Spread group communication system. The objective of DARM is to improve the dependability characteristics of systems through a fault treatment mechanism. Unlike many existing fault tolerance frameworks, DARM focuses on deployment and operational aspects, where the gain in terms of improved dependability is likely to be the greatest.

DARM is novel in that recovery decisions are distributed to each individual group deployed in the system, eliminating the need for a centralized manager with global information about all groups. This scheme allows groups to perform fault treatment on themselves. A group leader in each group is responsible for fault treatment by means of replacing failed group members; the approach also tolerates failure of the group leader. The advantages of the distributed approach is: (i) no need to maintain globally centralized information about all groups which is costly and limits scalability, (ii) reduced infrastructure complexity, and (iii) less communication overhead. We evaluate the approach experimentally to validate its fault handling capability; the recovery performance of a system deployed in a local area network is evaluated. The results show that applications can recover to their initial system configuration in a very short period of time.

[6] Hein Meling. Adaptive Middleware Support and Autonomous Fault Treatment: Architectural Design, Prototyping and Experimental Evaluation. PhD thesis, Norwegian University of Science and Technology, Department of Telematics, May 2006. [ bib | http | .pdf ]
Networked computer systems are prevalent in most aspects of modern society, and we have become dependent on such computer systems to perform many critical tasks. Moreover, making such systems dependable is an important goal. However, dependability issues are often neglected when developing systems due to the complexities of the techniques involved.

A common technique used to improve the dependability characteristics of systems is to replicate critical system components whereby the functions they perform are repeated by multiple replicas. Replicas are often distributed geographically and connected through a network as a means to render the failure of one replica independent of the others. However, the network is also a potential source of failures, as nodes can become temporarily disconnected from each other, introducing an array of new problems.

The majority of previous projects have focused on the provision of middleware libraries aimed at simplifying the development of dependable distributed systems, whereas the pivotal deployment and operational aspects of such systems have received very little attention. This thesis extends on previous works and emphasize the deployment and operational aspects, where the gain in terms of improved dependability is likely to be the greatest.

The main contribution of this dissertation is an architecture for autonomous replication management, aimed to improve the dependability characteristics of systems through a self-managed fault treatment mechanism that is adaptive to network dynamics and changing requirements. Consequently, the architecture also improves the deployment and operational aspect of systems, and reduces the human interactions needed. The architecture has been implemented as a proof of concept prototype by extending the Jgroup object group system.

In addition, numerous supporting contributions are also included in this work: (i) an architecture for dynamic protocol composition that avoids the delays of event processing in intermediate layers of a strictly vertical protocol stack; (ii) adaptive protocol selection is also made possible on a per method/invocation basis, by annotating server methods with the replication protocol to be used; (iii) client-side membership handling is also implemented aimed to improve the load balancing and failover properties of systems when exposed to failures; (iv) online upgrade management of operational services is also implemented as an extension to the replication management architecture.

Finally, the dissertation provides extensive experimental evaluation of the fault treatment capabilities of the autonomous replication management architecture, with emphasis on testing complex failure scenarios. The first experiment examines the ability of clients to maintain correct membership when servers crash and recover. The second experiment investigates the behavior of services when exposed to multiple nearly-coincident node crash failures. In conjunction with this experiment, a novel technique has been developed to estimate various service dependability characteristics. In the third experiment the recovery performance of a system deployed in a wide area network is evaluated. In this experiment multiple nearly-coincident reachability changes are injected to simulate network partitions separating the service replicas.

To support the experimental evaluation, a set of generic tools have also been developed to aid the execution and analysis of the experiments.

[7] Bjarne E. Helvik, Hein Meling, and Alberto Montresor. An Approach to Experimentally Obtain Service Dependability Characteristics of the Jgroup/ARM System. In Proceedings of the Fifth European Dependable Computing Conference (EDCC), volume 3463 of Lecture Notes in Computer Science, pages 179-198. Springer-Verlag, April 2005. [ bib | .pdf ]
Jgroup/ARM is a middleware framework for operating dependable distributed applications based on Java. Jgroup integrates the distributed ob ject models of Java RMI and Jini with the object group communication paradigm, enabling the construction of groups of replicated server ob jects that provide dependable services to clients. ARM provides automated mechanisms for distributing replicas to host processors and recovering from replica failures. This paper describes an approach based on stratified sampling combined with fault injections for estimating the dependability attributes of a service deployed using the Jgroup/ARM middleware framework. A first experimental evaluation is performed focusing on a service provided by a triplicated server, and indicative predictions of various dependability attributes of the service are obtained. The evaluation shows that a very high availability and MTBF may be achieved for services based on Jgroup/ARM.

[8] Hein Meling and Bjarne E. Helvik. Performance Consequences of Inconsistent Client-side Membership Information in the Open Group Model. In Proceedings of the 23rd International Performance, Computing, and Communications Conference (IPCCC), Phoenix, Arizona, April 2004. [ bib | .pdf ]
In a distributed fault-tolerant server system realized according to the open group model, inconsistency will (temporarily) arise between the dynamic membership of the replicated service and its client-side representation in the event of server failures and recoveries. The paper proposes techniques for maintaining this consistency and discuss their performance implications in failure/recovery scenarios where clients load balance requests on the servers. Comparative performance measurements is carried out for two of the proposed techniques. The results indicate that the performance impact of lacking consistency is easily kept small, and that the cost of the technique is small.

[9] Alberto Montresor, Hein Meling, and Ozalp Babaoglu. Toward Self-Organizing, Self-Repairing and Resilient Distributed Systems, chapter 22, pages 119-124. Number 2584 in Lecture Notes in Computer Science. Springer-Verlag, Bologna, Italy, June 2003. [ bib ]
[10] Ozalp Babaoglu, Hein Meling, and Alberto Montresor. Anthill: A Framework for the Development of Agent-Based Peer-to-Peer Systems. In Proceedings of the 22nd International Conference on Distributed Computing Systems (ICDCS), Vienna, Austria, July 2002. [ bib | .pdf ]
Recent peer-to-peer (P2P) systems are characterized by decentralized control, large scale and extreme dynamism of their operating environment. As such, they can be seen as instances of complex adaptive systems (CAS) typically found in biological and social sciences. In this paper we describe Anthill, a framework to support the design, implementation and evaluation of P2P applications based on ideas such as multi-agent and evolutionary programming borrowed from CAS. An Anthill system consists of a dynamic network of peer nodes; societies of adaptive agents travel through this network, interacting with nodes and cooperating with other agents in order to solve complex problems. Anthill can be used to construct different classes of P2P services that exhibit resilience, adaptation and self-organization properties. We also describe preliminary experiences with Anthill in implementing a file sharing application.