[an error occurred while processing this directive]
[1] |
Dan R. K. Ports.
The future of cloud networking is systems.
In Proceedings of the 15th Asia-Pacific Workshop on Systems
(APSYS '24), Kyoto, Japan, September 2024. ACM.
Keynote address.
[ bib |
slides (.pdf) ]
As cloud platforms evolve, the boundaries between networking and systems are increasingly blurred. Cloud networking not only asks us to solve a challenging systems problem -- providing an efficient, reliable, and secure virtual infrastructure -- but offers us opportunities to rethink our approach to classic challenges in distributed systems. This keynote will explore the potential of systems/networking co-design through the lens of a cloud-scale hardware-accelerated load balancing platform. Specifically, I will discuss how programmable networking technology, including programmable switches and smart NICs, enables these load balancers to achieve dramatically higher efficiency than existing software solutions while simultaneously offering increased flexibility for custom, application-specific load balancing logic. Looking to the future, I’ll describe three opportunities that this design offers for redefining the architecture of distributed systems with new load-balancing, migration, and snapshotting algorithms, paving the way to a new generation of high-performance, resilient, and scalable cloud services. |
[2] |
Liangcheng Yu, Xiao Zhang, Haoran Zhang, John Sonchak, Dan R. K. Ports, and
Vincent Liu.
Beaver: Practical partial snapshots for distributed cloud services.
In Proceedings of the 18th USENIX Symposium on Operating
Systems Design and Implementation (OSDI '24), Santa Clara, CA, USA,
July 2024. USENIX.
[ bib |
.pdf ]
Distributed snapshots are a classic class of protocols used for capturing a causally consistent view of states across machines. Although effective, existing protocols presume an isolated universe of processes to snapshot and require instrumentation and coordination of all. This assumption does not match today’s cloud services—it is not always practical to instrument all involved processes nor realistic to assume zero interaction of the machines of interest with the external world. |
[3] |
Inho Choi, Nimesh Wadekar, Raj Joshi, Dan R. K. Ports, Irene Zhang, and Jialin
Li.
Capybara: Microsecond-scale live tcp migration.
In Proceedings of the 14th Asia-Pacific Workshop on Systems
(APSYS '23), Seoul, South Korea, August 2023. ACM.
[ bib |
.pdf ]
Latency-critical μs-scale data center applications are susceptible to server load spikes. The issue is particularly challenging for services using long-lived TCP connections. This paper introduces Capybara, a highly efficient and versatile live TCP migration system. Capybara builds atop a deterministic, kernel-bypassed TCP stack running in a library OS to realize its μs-scale TCP migration mechanism. Using modern programmable switches, Capybara implements migration-aware dynamic packet forwarding and transient packet buffering, further reducing system interference during live TCP migration. Capybara can transparently migrate a running TCP connection in 4 μs on average. It improves the average migration host latency by about 12 times compared to a Linux kernel-based solution. |
[4] |
Inho Choi, Ellis Michael, Yunfan Li, Dan R. K. Ports, and Jialin Li.
Hydra: Serialization-free network ordering for strongly consistent
distributed applications.
In Proceedings of the 20th USENIX Symposium on Networked
Systems Design and Implementation (NSDI '23), Boston, MA, USA, April
2023. USENIX.
[ bib |
.pdf ]
A large class of distributed systems, e.g., state machine replication and fault-tolerant distributed databases, rely on establishing a consistent order of operations on groups of nodes in the system. Traditionally, an application-level distributed protocol such as Paxos and two-phase locking provide the ordering guarantees. To reduce the performance overhead imposed by these protocols, a recent line of work propose to move the responsibility of ensuring operation ordering into the network – by sequencing requests through a centralized network sequencer. This network sequencing approach yields significant application-level performance improvements, but routing all requests through a single sequencer comes with several fundamental limitations, including sequencer scalability bottleneck, prolonged system downtime during sequencer failover, worsened network-level load balancing, etc. |
[5] |
Alberto Lerner, Carsten Binnig, Philippe Cudré-Mauroux, Rana Hussein,
Matthias Jasny, Theo Jepsen, Dan R. K. Ports, Lasse Thostrup, and Tobias
Ziegler.
Databases on modern networks: A decade of research that now comes
into practice.
In Proceedings of the 49th International Conference on
Very Large Data Bases (VLDB '23), August 2023.
Tutorial.
[ bib |
.pdf ]
Modern cloud networks are a fundamental pillar of data-intensive applications. They provide high-speed transaction (packet) rates and low overhead, enabling, for instance, truly scalable database designs. These networks, however, are fundamentally different from conventional ones. Arguably, the two key discerning technologies are RDMA and programmable network devices. Today, these technologies are not niche technologies anymore and are widely deployed across all major cloud vendors. The question is thus not if but how a new breed of data-intensive applications can benefit from modern networks, given the perceived difficulty in using and programming them. This tutorial addresses these challenges by exposing how the underlying principles changed as the network evolved and by presenting the new system design opportunities they opened. In the process, we also discuss several hard-earned lessons accumulated by making the transition first-hand. |
[6] |
Ziyuan Liu, Zhixiong Niu, Ran Shu, Liang Gao, Guohong Lai, Na Wang, Zongying
He, Jacob Nelson, Dan R. K. Ports, Lihua Yuan, Peng Cheng, and Yongqiang
Xiong.
Slimemold: Hardware load balancer at scale in datacenter.
In Proceedings of the 7th Asia-Pacific Workshop on Networking
(APNet '23), Hong Kong, China, July 2023. ACM.
[ bib |
slides (.pdf) |
.pdf ]
Stateful load balancers (LB) are essential services in cloud data centers, playing a crucial role in enhancing the availability and capacity of applications. Numerous studies have proposed methods to improve the throughput, connections per second, and concurrent flows of single LBs. For instance, with the advancement of programmable switches, hardware-based load balancers (HLB) have become mainstream due to their high efficiency. However, programmable switches still face the issue of limited registers and table entries, preventing them from fully meeting the performance requirements of data centers. In this paper, rather than solely focusing on enhancing individual HLBs, we introduce SlimeMold, which enables HLBs to work collaboratively at scale as an integrated LB system in data centers. |
[7] |
Yifan Yuan, Jinghan Huang, Yan Sun, Tianchen Wang, Jacob Nelson, Dan Ports,
Yipeng Wang, Ren Wang, Charlie Tai, and Nam Sung Kim.
RAMBDA: RDMA-driven acceleration framework for memory-intensive
us-scale datacenter applications.
In Proceedings of the 29th International Symposium on High
Performance Computer Architecture (HPCA '23), Montreal, QC, Canada,
February 2023. IEEE.
[ bib |
.pdf ]
Responding to the "datacenter tax" and "killer microseconds" problems for memory-intensive datacenter applications, diverse solutions including Smart NIC-based ones have been proposed. Nonetheless, they often suffer from high overhead of communications over network and/or PCIe links. To tackle the limitations of the current solutions, this paper proposes RAMBDA, RDMA-driven acceleration framework for Boosting performance of memory-intensive us-scale datacenter applications. this paper proposes RAMBDA, a holistic network and architecture co-design solution RAMBDA leverages current RDMA and emerging cache-coherent off-chip interconnect technologies and consists of the following four hardware and software components: (1) unified abstraction of inter- and intra-machine communications synergistically managed by one-sided RDMA write and cache-coherent memory write; (2) efficient notification of requests to accelerators assisted by cache coherence; (3) cache-coherent accelerator architecture directly interacting with NIC; and (4) adaptive device-to-host data transfer for modern server memory systems comprising both DRAM and NVM exploiting state-of-the-art features in CPUs and PCIe. We prototype RAMBDA with a commercial system and evaluate three popular datacenter applications: (1) in-memory key-value store, (2) chain replication-based distributed transaction system, and (3) deep learning recommendation model inference. The evaluation shows that RAMBDA provides 30.1∼69.1 lower latency, up to 2.5x higher throughput, and 3x higher energy efficiency than the current state-of-the-art solutions. |
[8] |
Lior Zeno, Dan R. K. Ports, Jacob Nelson, Daehyeok Kim, Shir Landau Feibish,
Idit Keidar, Arik Rinberg, Alon Rashelbach, Igor De-Paula, and Mark
Silberstein.
SwiSh: Distributed shared state abstractions for programmable
switches.
In Proceedings of the 16th ACM International System and
Storage Conference (SYSTOR '23), Haifa, Israel, July 2023. ACM.
Highlights Session.
[ bib ]
We design and evaluate SwiShmem, a distributed shared state management layer for data-plane P4 programs. SwiShmem enables running scalable stateful distributed network functions on programmable switches entirely in the data-plane. We explore several schemes to build a shared variable abstraction, which differ in consistency, performance, and in-switch implementation complexity. |
[9] |
Ziyuan Liu, Zhixiong Niu, Ran Shu, Wenxue Cheng, Peng Cheng, Yongqiang Xiong,
Lihua Yuan, Jacob Nelson, and Dan R. K. Ports.
A disaggregate data collecting approach for loss-tolerant
applications.
In Proceedings of the 6th Asia-Pacific Workshop on Networking
(APNet '22), Fuzhou, China, July 2022. ACM.
[ bib |
.pdf ]
Datacenter generates operation data at an extremely high rate, and data center operators collect and analyze them for problem diagnosis, resource utilization improvement, and performance optimization. However, existing data collection methods fail to efficiently aggregate and store data at extremely high speed and scale. In this paper, we explore a new approach that leverages programmable switches to aggregate data and directly write data to the destination storage. Our proposed data collection system, ALT, uses programmable switches to control NVMe SSDs on remote hosts without the involvement of a remote CPU. To tolerate loss, ALT uses an elegant data structure to enable efficient data recovery when retrieving the collected data. We implement our system on a Tofino-based programmable switch for a prototype. Our evaluation shows that ALT can saturate SSD’s peak performance without any CPU involvement. |
[10] |
Yifan Yuan, Omar Alama, Jiawei Fei, Jacob Nelson, Dan R. K. Ports, Amedeo
Sapio, Marco Canini, and Nam Sung Kim.
Unlocking the power of inline floating-point operations on
programmable switches.
In Proceedings of the 19th USENIX Symposium on Networked
Systems Design and Implementation (NSDI '22), Renton, WA, USA, April
2022. USENIX.
[ bib |
slides (.pdf) |
.pdf ]
The advent of switches with programmable dataplanes has enabled the rapid development of new network functionality, as well as providing a platform for acceleration of a broad range of application-level functionality. However, existing switch hardware was not designed with application acceleration in mind, and thus applications requiring operations or datatypes not used in traditional network protocols must resort to expensive workarounds. Applications involving floating point data, including distributed training for machine learning and distributed query processing, are key examples. |
[11] |
Lior Zeno, Dan R. K. Ports, Jacob Nelson, Daehyeok Kim, Shir Landau Feibish,
Idit Keidar, Arik Rinberg, Alon Rashelbach, Igor De-Paula, and Mark
Silberstein.
SwiSh: Distributed shared state abstractions for programmable
switches.
In Proceedings of the 19th USENIX Symposium on Networked
Systems Design and Implementation (NSDI '22), Renton, WA, USA, April
2022. USENIX.
[ bib |
slides (.pdf) |
.pdf ]
We design and evaluate SwiShmem, a distributed shared state management layer for data-plane P4 programs. SwiShmem enables running scalable stateful distributed network functions on programmable switches entirely in the data-plane. We explore several schemes to build a shared variable abstraction, which differ in consistency, performance, and in-switch implementation complexity. |
[12] |
Hang Zhu, Tao Wang, Yi Hong, Dan R. K. Ports, Anirudh Sivaraman, and Xin Jin.
NetVRM: Virtual register memory for programmable networks.
In Proceedings of the 19th USENIX Symposium on Networked
Systems Design and Implementation (NSDI '22), Renton, WA, USA, April
2022. USENIX.
[ bib |
slides (.pdf) |
.pdf ]
Programmable networks are enabling a new class of applications that leverage the line-rate processing capability and on-chip register memory of the switch data plane. Yet the status quo is focused on developing approaches that share the register memory statically. We present NetVRM, a network management system that supports dynamic register memory sharing between multiple concurrent applications on a programmable network and is readily deployable on commodity programmable switches. NetVRM provides a virtual register memory abstraction that enables applications to share the register memory in the data plane, and abstracts away the underlying details. In principle, NetVRM supports any memory allocation algorithm given the virtual register memory abstraction. It also provides a default memory allocation algorithm that exploits the observation that applications have diminishing returns on additional memory. NetVRM provides an extension of P4, P4VRM, for developing applications with virtual register memory, and a compiler to generate data plane programs and control plane APIs. Testbed experiments show that NetVRM is general to a diverse variety of applications, and that its utility-based dynamic allocation policy outperforms static resource allocation. Specifically, it improves the mean satisfaction ratio (i.e., the fraction of a network application’s lifetime that it meets its utility target) by 1.6-2.2x under a range of workloads. |
[13] |
Matthew Burke, Sowmya Dharanipragada, Shannon Joyner, Adriana Szekeres, Jacob
Nelson, Irene Zhang, and Dan R. K. Ports.
PRISM: Rethinking the RDMA interface for distributed systems.
In Proceedings of the 28th ACM Symposium on Operating
Systems Principles (SOSP '21), Virtual Conference, October 2021.
ACM.
[ bib |
.pdf ]
Remote Direct Memory Access (RDMA) has been used to accelerate a variety of distributed systems, by providing low-latency, CPU-bypassing access to a remote host's memory. However, most of the distributed protocols used in these systems cannot easily be expressed in terms of the simple memory READs and WRITEs provided by RDMA. As a result, designers face a choice between introducing additional protocol complexity (e.g., additional round trips) or forgoing the benefits of RDMA entirely. |
[14] |
Daehyeok Kim, Jacob Nelson, Dan R. K. Ports, Vyas Sekar, and Srinivasan Seshan.
RedPlane: Enabling fault tolerant stateful in-switch applications.
In Proceedings of ACM SIGCOMM 2021, Virtual Conference,
August 2021. ACM.
[ bib |
.pdf ]
Many recent efforts have demonstrated the performance benefits of running datacenter functions (e.g., NATs, load balancers, monitoring) on programmable switches. However, a key missing piece remains: fault tolerance. This is especially critical as the network is no longer stateless and pure endpoint recovery does not suffice. In this paper, we design and implement RedPlane, a fault-tolerant state store for stateful in-switch applications. This provides in-switch applications consistent access to their state, even if the switch they run on fails or traffic is rerouted to an alternative switch. We address key challenges in devising a practical, provably correct replication protocol and implementing it in the switch data plane. Our evaluations show that RedPlane incurs negligible overhead and enables end-to-end applications to rapidly recover from switch failures. |
[15] |
Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon
Kim, Arvind Krishnamurthy, Masoud Moshref, Dan R. K. Ports, and Peter
Richtarik.
Scaling distributed machine learning with in-network aggregation.
In Proceedings of the 18th USENIX Symposium on Networked
Systems Design and Implementation (NSDI '21), Boston, MA, USA, April
2021. USENIX.
[ bib |
slides (.pdf) |
.pdf ]
Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide an efficient solution that speeds up training by up to 5.5x for a number of real-world benchmark models. |
[16] |
Huaicheng Li, Mingzhe Hao, Stanko Novakovic, Vaibhav Gogte, Sriram Govindan,
Dan R. K. Ports, Irene Zhang, Ricardo Bianchini, Haryadi S. Gunawi, and
Anirudh Badam.
LeapIO: Efficient and portable virtual NVMe storage on ARM
SoCs.
In Proceedings of the 24th International Conference on
Architectural Support for Programming Languages and Operating
Systems (ASPLOS '20), Lausanne, Switzerland, April 2020. ACM.
[ bib |
.pdf ]
Today's cloud storage stack is extremely resource hungry, burning 10-20% of datacenter x86 cores, a major "storage tax" that cloud providers must pay. Yet, the complex cloud storage stack is not completely offload-ready to today's IO accelerators. We present LeapIO, a new cloud storage stack that leverages ARM-based co-processors to offload complex storage services. LeapIO addresses many deployment challenges, such as hardware fungibility, software portability, virtualizability, composability, and efficiency. It uses a set of OS/software techniques and new hardware properties that provide a uniform address space across the x86 and ARM cores and expose virtual NVMe storage to unmodified guest VMs, at a performance that is competitive with bare-metal servers. |
[17] |
Jialin Li, Jacob Nelson, Ellis Michael, Xin Jin, and Dan R. K. Ports.
Pegasus: Tolerating skewed workloads in distributed storage with
in-network coherence directories.
In Proceedings of the 14th USENIX Symposium on Operating
Systems Design and Implementation (OSDI '20), Banff, AL, Canada,
November 2020. USENIX.
[ bib |
slides (.pdf) |
.pdf ]
High performance distributed storage systems face the challenge of load imbalance caused by skewed and dynamic workloads. This paper introduces Pegasus, a new storage system that leverages new-generation programmable switch ASICs to balance load across storage servers. Pegasus uses selective replication of the most popular objects in the data store to distribute load. Using a novel in-network coherence directory, the Pegasus switch tracks and manages the location of replicated objects. This allows it to achieve load-aware forwarding and dynamic rebalancing for replicated keys, while still guaranteeing data coherence and consistency. The Pegasus design is practical to implement as it stores only forwarding metadata in the switch data plane. The resulting system improves the throughput of a distributed in-memory key-value store by more than 10x under a latency SLO -- results which hold across a large set of workloads with varying degrees of skew, read/write ratio, object sizes, and dynamism. |
[18] |
Adriana Szekeres, Michael Whittaker, Naveen Kr. Sharma, Jialin Li, Arvind
Krishnamurthy, Irene Zhang, and Dan R. K. Ports.
Meerkat: Scalable replicated transactions following the
zero-coordination principle.
In Proceedings of the 15th ACM SIGOPS EuroSys (EuroSys
'20), Heraklion, Crete, Greece, April 2020. ACM.
[ bib |
.pdf ]
Traditionally, the high cost of network communication between servers has hidden the impact of cross-core coordination in replicated systems. However, new technologies, like kernel-bypass networking and faster network links, have exposed hidden bottlenecks in distributed systems. |
[19] |
Tao Wang, Hang Zhu, Fabian Ruffy, Xin Jin, Anirudh Sivaraman, Dan R. K. Ports,
and Aurojit Panda.
Multitenancy for fast and programmable networks in the cloud.
In Proceedings of the 11th Hot Topics in Cloud Computing
(HotCloud '20), Boston, MA, USA, July 2020. USENIX.
[ bib |
.pdf ]
Fast and programmable network devices are now readily available, both in the form of programmable switches and smart network-interface cards. Going forward, we envision that these devices will be widely deployed in the networks of cloud providers (e.g., AWS, Azure, and GCP) and exposed as a programmable surface for cloud customers—similar to how cloud customers can today rent CPUs, GPUs, FPGAs, and ML accelerators. Making this vision a reality requires us to develop a mechanism to share the resources of a programmable network device across multiple cloud tenants. In other words, we need to provide multitenancy on these devices. In this position paper, we design compile and run-time approaches to multitenancy. We present preliminary results showing that our design provides both efficient utilization of the resources of a programmable network device and isolation of tenant programs from each other. |
[20] |
Lior Zeno, Dan R. K. Ports, Jacob Nelson, and Mark Silberstein.
SwiShmem: Distributed shared state abstractions for programmable
switches.
In Proceedings of the 16th Workshop on Hot Topics in Networks
(HotNets '20), Chicago, IL, USA, November 2020. ACM.
[ bib |
.pdf ]
Programmable switches provide an appealing platform for running network functions (NFs), such as NATs, firewalls and DDoS detectors, entirely in data plane, at staggering multi-Tbps processing rates. However, to be used in real deployments with a complex multi-switch topology, one NF instance must be deployed on each switch, which together act as a single logical NF. This requirement poses significant challenges in particular for stateful NFs, due to the need to manage |
[21] |
Dan R. K. Ports and Jacob Nelson.
When should the network be the computer?
In Proceedings of the 17th Workshop on Hot Topics in Operating
Systems (HotOS '19), Bertinoro, Italy, May 2019. ACM.
[ bib |
slides (.pdf) |
.pdf ]
Researchers have repurposed programmable network devices to place small amounts of application computation in the network, sometimes yielding orders-of-magnitude performance gains. At the same time, effectively using these devices requires careful use of limited resources and managing deployment challenges. |
[22] |
Ellis Michael and Dan R. K. Ports.
Towards causal datacenter networks.
In Proceedings of the 2018 Workshop on Principles and Practice
of Consistency for Distributed Data (PaPoC '18), Porto, Portugal, April
2018. ACM.
[ bib |
.pdf ]
Traditionally, distributed systems conservatively assume an asynchronous network. However, recent work on the co-design of networks and distributed systems has shown that stronger ordering properties are achievable in datacenter networks and yield performance improvements for the distributed systems they support. We build on that trend and ask whether it is possible for the datacenter network to order all messages in a protocol-agnostic way. This approach, which we call omnisequencing, would ensure causal delivery of all messages, making consistency a network-level guarantee. |
[23] |
Helga Gudmundsdottir, Babak Salimi, Magdalena Balazinska, Dan R. K. Ports, and
Dan Suciu.
A demonstration of interactive analysis of performance measurements
with Viska.
In Proceedings of the 2017 ACM SIGMOD International
Conference on Management of Data, Chicago, IL, USA, May 2017. ACM.
Demonstration.
[ bib |
.pdf ]
The ultimate goal of system performance analysis is to identify the underlying causes for performance differences between different systems and different workloads. We make it easier to achieve this goal with Viska, a new tool for generating and interpreting performance measurement results. and Viska leverages cutting-edge techniques from big data analytics and data visualization to aid and automate this analysis, and helps users derive meaningful and statistically sound conclusions using state-of-the-art causal inference and hypothesis testing techniques. |
[24] |
Jialin Li, Ellis Michael, and Dan R. K. Ports.
Eris: Coordination-free consistent transactions using network
multi-sequencing.
In Proceedings of the 26th ACM Symposium on Operating
Systems Principles (SOSP '17), Shanghai, China, October 2017. ACM.
[ bib |
.pdf ]
Distributed storage systems aim to provide strong consistency and isolation guarantees on an architecture that is partitioned across multiple shards for scalability and replicated for fault-tolerance. Traditionally, achieving all of these goals has required an expensive combination of atomic commitment and replication protocols -- introducing extensive coordination overhead. Our system, Eris, takes a very different approach. It moves a core piece of concurrency control functionality, which we term multi-sequencing, into the datacenter network itself. This network primitive takes on the responsibility for consistently ordering transactions, and a new lightweight transaction protocol ensures atomicity. The end result is that Eris avoids both replication and transaction coordination overhead: we show that it can process a large class of distributed transactions in a single round-trip from the client to the storage system without any explicit coordination between shards or replicas. It provides atomicity, consistency, and fault-tolerance with less than 10% overhead -- achieving throughput 4.5--35x higher and latency 72--80% lower than a conventional design on standard benchmarks. |
[25] |
Ellis Michael, Dan R. K. Ports, Naveen Kr. Sharma, and Adriana Szekeres.
Recovering shared objects without stable storage.
In Proceedings of the 31st International Symposium on
Distributed Computing (DISC '17), Vienna, Austria, October 2017.
[ bib |
.pdf ]
This paper considers the problem of building fault-tolerant shared objects when processes can crash and recover but lose their persistent state on recovery. This Diskless Crash-Recovery (DCR) model matches the way many long-lived systems are built. We show that it presents new challenges, as operations that are recorded at a quorum may not persist after some of the processes in that quorum crash and then recover. |
[26] |
Babak Salimi, Corey Cole, Dan R. K. Ports, and Dan Suciu.
Zaliql: Causal inference from observational data at scale.
In Proceedings of the 43rd International Conference on
Very Large Data Bases (VLDB '17), August 2017.
Demonstration.
[ bib |
.pdf ]
Causal inference from observational data is a subject of active research and development in statistics and computer science. Many statistical software packages have been developed for this purpose. However, these toolkits do not scale to large datasets. We propose and demonstrate ZaliQL: a SQL-based framework for drawing causal inference from observational data. ZaliQL supports the state-of-the-art methods for causal inference and runs at scale within PostgreSQL database system. In addition, we built a visual interface to wrap around ZaliQL. In our demonstration, we will use this GUI to show a live investigation of the causal effect of different weather conditions on flight delays. |
[27] |
Helga Gudmundsdottir, Babak Salimi, Magdalena Balazinska, Dan R. K. Ports, and
Dan Suciu.
Viska: Enabling interactive analysis of performance measurements.
In Proceedings of the 12th USENIX Symposium on Operating
Systems Design and Implementation (OSDI '16), Savannah, GA, USA,
November 2016. USENIX.
Poster.
[ bib ]
Much of systems research consists of performance analysis -- to learn when one system outperforms another, to identify architectural choices responsible for the difference, or to identify performance anomalies in particular workloads, for example. However, despite recent advances in data analytics and interactive data visualization, the tools we use for performance analysis remain remarkably primitive. |
[28] |
Brandon Holt, James Bornholt, Irene Zhang, Dan R. K. Ports, Mark Oskin, and
Luis Ceze.
Disciplined inconsistency with consistency types.
In Proceedings of the 7th Symposium on Cloud Computing (SOCC
'16), Santa Clara, CA, USA, October 2016. ACM.
[ bib |
.pdf ]
Distributed applications and web services, such as online stores or social networks, are expected to be scalable, available, responsive, and fault-tolerant. To meet these steep requirements in the face of high round-trip latencies, network partitions, server failures, and load spikes, applications use eventually consistent datastores that allow them to weaken the consistency of some data. However, making this transition is highly error-prone because relaxed consistency models are notoriously difficult to understand and test. |
[29] |
Jialin Li, Ellis Michael, Adriana Szekeres, Naveen Kr. Sharma, and Dan R. K.
Ports.
Just say NO to Paxos overhead: Replacing consensus with network
ordering.
In Proceedings of the 12th USENIX Symposium on Operating
Systems Design and Implementation (OSDI '16), Savannah, GA, USA,
November 2016. USENIX.
[ bib |
.pdf ]
Distributed applications use replication, implemented by protocols like Paxos, to ensure data availability and transparently mask server failures. This paper presents a new approach to achieving replication in the data center without the performance cost of traditional methods. Our work carefully divides replication responsibility between the network and protocol layers. The network orders requests but does not ensure reliable delivery -- using a new primitive we call ordered unreliable multicast (OUM). Implementing this primitive can be achieved with near-zero-cost in the data center. Our new replication protocol, Network-Ordered Paxos (NOPaxos), exploits network ordering to provide strongly consistent replication without coordination. The resulting system not only outperforms both latency- and throughput-optimized protocols on their respective metrics, but also yields throughput within 2% and latency within 16 us of an unreplicated system -- providing replication without the performance cost. |
[30] |
Brandon Holt, Irene Zhang, Dan R. K. Ports, Mark Oskin, and Luis Ceze.
Claret: Using data types for highly concurrent distributed
transactions.
In Proceedings of the 2015 Workshop on Principles and Practice
of Consistency for Distributed Data (PaPoC '15), Bordeaux, France, April
2015. ACM.
[ bib |
.pdf ]
Out of the many NoSQL databases in use today, some that provide simple data structures for records, such as Redis and MongoDB, are now becoming popular. Building applications out of these complex data types provides a way to communicate intent to the database system without sacrificing flexibility or committing to a fixed schema. Currently this capability is leveraged in limited ways, such as to ensure related values are co-located, or for atomic updates. There are many ways data types can be used to make databases more efficient that are not yet being exploited. |
[31] | Brandon Holt, Irene Zhang, Dan R. K. Ports, Mark Oskin, and Luis Ceze. Claret: Using data types for highly concurrent distributed transactions. In Proceedings of the 10th ACM SIGOPS EuroSys (EuroSys '15), Bordeaux, France, April 2015. ACM. Poster. [ bib ] |
[32] |
Dan R. K. Ports, Jialin Li, Vincent Liu, Naveen Kr. Sharma, and Arvind
Krishnamurthy.
Designing distributed systems using approximate synchrony in
datacenter networks.
In Proceedings of the 12th USENIX Symposium on Networked
Systems Design and Implementation (NSDI '15), Oakland, CA, USA, May
2015. USENIX.
[ bib |
slides (.pdf) |
.pdf ]
Distributed systems are traditionally designed independently from the underlying network, making worst-case assumptions (e.g., complete asynchrony) about its behavior. However, many of today's distributed applications are deployed in data centers, where the network is more reliable, predictable, and extensible. In these environments, it is possible to co-design distributed systems with their network layer, and doing so can offer substantial benefits. |
[33] | Naveen Kr. Sharma, Brandon Holt, Irene Zhang, Dan R. K. Ports, and Marcos Aguilera. Transtorm: a benchmark suite for transactional key-value storage systems. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP '15), Monterey, CA, USA, October 2015. ACM. Poster. [ bib ] |
[34] |
Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan
R. K. Ports.
Building consistent transactions with inconsistent replication.
In Proceedings of the 25th ACM Symposium on Operating
Systems Principles (SOSP '15), Monterey, CA, USA, October 2015. ACM.
[ bib |
.pdf ]
Application programmers increasingly prefer distributed storage systems with strong consistency and distributed transactions (e.g., Google's Spanner) for their strong guarantees and ease of use. Unfortunately, existing transactional storage systems are expensive to use -- in part because they require costly replication protocols, like Paxos, for fault tolerance. In this paper, we present a new approach that makes transactional storage systems more affordable: we eliminate consistency from the replication protocol while still providing distributed transactions with strong consistency to applications. |
[35] |
Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble.
Tales of the tail: Hardware, OS, and application-level sources of
tail latency.
In Proceedings of the 5th Symposium on Cloud Computing (SOCC
'14), Seattle, WA, USA, November 2014. ACM.
[ bib |
.pdf ]
Interactive services often have large-scale parallel implementations. To deliver fast responses, the median and tail latencies of a service's components must be low. In this paper, we explore the hardware, OS, and application-level sources of poor tail latency in high throughput servers executing on multi-core machines. |
[36] |
Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind
Krishnamurthy, Thomas Anderson, and Timothy Roscoe.
Arrakis: The operating system is the control plane.
In Proceedings of the 11th USENIX Symposium on Operating
Systems Design and Implementation (OSDI '14), Broomfield, CO, USA,
October 2014. USENIX.
[ bib |
.pdf ]
Recent device hardware trends enable a new approach to the design of network server operating systems. In a traditional operating system, the kernel mediates access to device hardware by server applications, to enforce process isolation as well as network and disk security. We have designed and implemented a new operating system, Arrakis, that splits the traditional role of the kernel in two. Applications have direct access to virtualized I/O devices, allowing most I/O operations to skip the kernel entirely, while the kernel is re-engineered to provide network and disk protection without kernel mediation of every operation. We describe the hardware and software changes needed to take advantage of this new abstraction, and we illustrate its power by showing improvements of 2-5x in latency and 9x in throughput for a popular persistent NoSQL store relative to a well-tuned Linux implementation |
[37] |
Simon Peter, Jialin Li, Doug Woos, Irene Zhang, Dan R. K. Ports, Thomas
Anderson, Arvind Krishnamurthy, and Mark Zbikowski.
Towards high-performance application-level storage management.
In Proceedings of the 5th Hot Topics in Storage and File Systems
(HotStorage '14), Philadelphia, PA, USA, June 2014. USENIX.
[ bib |
.pdf ]
We propose a radical re-architecture of the traditional operating system storage stack to move the kernel off the data path. Leveraging virtualized I/O hardware for disk and flash storage, most read and write I/O operations go directly to application code. The kernel dynamically allocates extents, manages the virtual to physical binding, and performs name translation. The benefit is to dramatically reduce the CPU overhead of storage operations while improving application flexibility. |
[38] | Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. Optimistic replicated two-phase commit. In Proceedings of the 5th Asia-Pacific Workshop on Systems (APSYS '14), Beijing, China, June 2014. Poster and extended abstract. [ bib ] |
[39] |
Danyang Zhuo, Qiao Zhang, Dan R. K. Ports, Arvind Krishnamurthy, and Thomas
Anderson.
Machine fault tolerance for reliable datacenter systems.
In Proceedings of the 5th Asia-Pacific Workshop on Systems
(APSYS '14), Beijing, China, June 2014.
[ bib |
.pdf ]
Although rare in absolute terms, undetected CPU, memory, and disk errors occur often enough at data center scale to significantly affect overall system reliability and availability. In this paper, we propose a new failure model, called Machine Fault Tolerance, and a new abstraction, a replicated write-once trusted table, to provide improved resilience to these types of failures. Since most machine failures manifest in application server and operating system code, we assume a Byzantine model for those parts of the system. However, by assuming that the hypervisor and network are trustworthy, we are able to reduce the overhead of machine-fault masking to be close to that of non-Byzantine Paxos. |
[40] |
Winnie Cheng, Dan R. K. Ports, David Schultz, Victoria Popic, Aaron Blankstein,
James Cowling, Dorothy Curtis, Liuba Shrira, and Barbara Liskov.
Abstractions for usable information flow control in Aeolus.
In Proceedings of the 2012 USENIX Annual Technical
Conference, Boston, MA, USA, June 2012. USENIX.
[ bib |
slides (.pdf) |
.pdf ]
Despite the increasing importance of protecting confidential data, building secure software remains as challenging as ever. This paper describes Aeolus, a new platform for building secure distributed applications. Aeolus uses information flow control to provide confidentiality and data integrity. It differs from previous information flow control systems in a way that we believe makes it easier to understand and use. Aeolus uses a new, simpler security model, the first to combine a standard principal-based scheme for authority management with thread-granularity information flow tracking. The principal hierarchy matches the way developers already reason about authority and access control, and the coarse-grained information flow tracking eases the task of defining a program's security restrictions. In addition, Aeolus provides a number of new mechanisms (authority closures, compound tags, boxes, and shared volatile state) that support common design patterns in secure application design. |
[41] |
Dan R. K. Ports, Austin T. Clements, Irene Zhang, Samuel Madden, and Barbara
Liskov.
Transactional consistency and automatic management in an application
data cache.
In Proceedings of the 9th USENIX Symposium on Operating
Systems Design and Implementation (OSDI '10), Vancouver, BC, Canada,
October 2010. USENIX.
[ bib |
slides (.pdf) |
.ps.gz |
.pdf ]
Distributed in-memory application data caches like memcached are a popular solution for scaling database-driven web sites. These systems are easy to add to existing deployments, and increase performance significantly by reducing load on both the database and application servers. Unfortunately, such caches do not integrate well with the database or the application. They cannot maintain transactional consistency across the entire system, violating the isolation properties of the underlying database. They leave the application responsible for locating data in the cache and keeping it up to date, a frequent source of application complexity and programming errors. |
[42] |
James Cowling, Dan R. K. Ports, Barbara Liskov, Raluca Ada Popa, and Abhijeet
Gaikwad.
Census: Location-aware membership management for large-scale
distributed systems.
In Proceedings of the 2009 USENIX Annual Technical
Conference, San Diego, CA, USA, June 2009. USENIX.
[ bib |
slides (.pdf) |
.ps.gz |
.pdf ]
We present Census, a platform for building large-scale distributed applications. Census provides a membership service and a multicast mechanism. The membership service provides every node with a consistent view of the system membership, which may be global or partitioned into location-based regions. Census distributes membership updates with low overhead, propagates changes promptly, and is resilient to both crashes and Byzantine failures. We believe that Census is the first system to provide a consistent membership abstraction at very large scale, greatly simplifying the design of applications built atop large deployments such as multi-site data centers. |
[43] |
Dan R. K. Ports, Austin T. Clements, Irene Y. Zhang, Samuel Madden, and Barbara
Liskov.
Transactional caching of application data using recent snapshots.
In Proceedings of the 22nd ACM Symposium on Operating
Systems Principles (SOSP '09), Big Sky, MT, USA, October 2009. ACM.
Work in Progress report.
[ bib |
slides (.pdf) |
.ps.gz |
.pdf ]
Many of today's well-known websites use application data caches to reduce the bottleneck load on the database, as well as the computational load on the application servers. Distributed in-memory shared caches, exemplified by memcached, are one popular approach. These caches typically provide a get/put interface, akin to a distributed hash table; the application chooses what data to keep in the cache and keeps it up to date. By storing the cache entirely in memory and horizontally partitioning among nodes, in-memory caches provide quick response times and ease of scaling. |
[44] |
Xiaoxin Chen, Tal Garfinkel, E. Christopher Lewis, Pratap Subrahmanyam, Carl A.
Waldspurger, Dan Boneh, Jeffrey Dwoskin, and Dan R. K. Ports.
Overshadow: A virtualization-based approach to retrofitting
protection in commodity operating systems.
In Proceedings of the 13th International Conference on
Architectural Support for Programming Languages and Operating
Systems (ASPLOS '08), Seattle, WA, USA, March 2008. ACM.
[ bib |
.ps.gz |
.pdf ]
Commodity operating systems entrusted with securing sensitive data are remarkably large and complex, and consequently, frequently prone to compromise. To address this limitation, we introduce a virtual-machine-based system called Overshadow that protects the privacy and integrity of application data, even in the event of a total OS compromise. Overshadow presents an application with a normal view of its resources, but the OS with an encrypted view. This allows the operating system to carry out the complex task of managing an application's resources, without allowing it to read or modify them. Thus, Overshadow offers a last line of defense for application data. |
[45] |
Dan R. K. Ports and Tal Garfinkel.
Towards application security on untrusted operating systems.
In Proceedings of the 3rd Workshop on Hot Topics in Security
(HotSec '08), San Jose, CA, USA, July 2008. USENIX.
[ bib |
slides (.pdf) |
.ps.gz |
.pdf ]
Complexity in commodity operating systems makes compromises inevitable. Consequently, a great deal of work has examined how to protect security-critical portions of applications from the OS through mechanisms such as microkernels, virtual machine monitors, and new processor architectures. Unfortunately, most work has focused on CPU and memory isolation and neglected OS semantics. Thus, while much is known about how to prevent OS and application processes from modifying each other, far less is understood about how different OS components can undermine application security if they turn malicious. |
[46] |
Austin T. Clements, Dan R. K. Ports, and David R. Karger.
Arpeggio: Metadata searching and content sharing with Chord.
In Proceedings of the 4th International Workshop on Peer-to-Peer
Systems (IPTPS '05), volume 3640 of Lecture Notes in Computer
Science, pages 58--68, Ithaca, NY, USA, February 2005. Springer.
[ bib |
slides (.pdf) |
.ps.gz |
.pdf ]
Arpeggio is a peer-to-peer file-sharing network based on the Chord lookup primitive. Queries for data whose metadata matches a certain criterion are performed efficiently by using a distributed keyword-set index, augmented with index-side filtering. We introduce index gateways, a technique for minimizing index maintenance overhead. Because file data is large, Arpeggio employs subrings to track live source peers without the cost of inserting the data itself into the network. Finally, we introduce postfetching, a technique that uses information in the index to improve the availability of rare files. The result is a system that provides efficient query operations with the scalability and reliability advantages of full decentralization, and a content distribution system tuned to the requirements and capabilities of a peer-to-peer network. |
[47] | Dan R. K. Ports, Austin T. Clements, and Erik D. Demaine. PersiFS: A versioned file system with an efficient representation. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP '05), Brighton, United Kingdom, October 2005. ACM. Poster and extended abstract. [ bib ] |
[48] |
Austin T. Clements, Dan R. K. Ports, and David R. Karger.
Arpeggio: Efficient metadata-based searching and file transfer with
DHTs.
In Proceedings of the 2nd Project IRIS Student Workshop (ISW
'04), Cambridge, MA, USA, November 2004.
Poster and extended abstract.
[ bib ]
|
[49] |
Dan R. K. Ports.
Arpeggio: Metadata indexing in a structured peer-to-peer network.
M.eng. thesis, Massachusetts Institute of Technology, Cambridge, MA,
USA, February 2007.
[ bib |
.ps.gz |
.pdf ]
Peer-to-peer networks require an efficient means for performing searches for files by metadata keywords. Unfortunately, current methods usually sacrifice either scalability or recall. Arpeggio is a peer-to-peer file-sharing network that uses the Chord lookup primitive as a basis for constructing distributed keyword-set index, augmented with index-side filtering, to address this problem. We introduce index gateways, a technique for minimizing index maintenance overhead. Arpeggio also includes a content distribution system for finding source peers for a file; we present a novel system that uses Chord subrings to track live source peers without the cost of inserting the data itself into the network, and supports postfetching: using information in the index to improve the availability of rare files. The result is a system that provides efficient query operations with the scalability and reliability advantages of full decentralization. We use analysis and simulation results to show that our indexing system has reasonable storage and bandwidth costs, and improves load distribution. |
[50] |
Adriana Szekeres, Irene Zhang, Katelin Bailey, Isaac Ackerman, Haichen Shen,
Franziska Roesner, Dan R. K. Ports, Arvind Krishnamurthy, and Henry M. Levy.
Making distributed mobile applications SAFE: Enforcing user privacy
policies on untrusted applications with secure application flow enforcement.
arXiv preprint 2008.06536, arXiv, August 2020.
[ bib |
.pdf ]
Today's mobile devices sense, collect, and store huge amounts of personal information, which users share with family and friends through a wide range of applications. Once users give applications access to their data, they must implicitly trust that the apps correctly maintain data privacy. As we know from both experience and all-too-frequent press articles, that trust is often misplaced. While users do not trust applications, they do trust their mobile devices and operating systems. Unfortunately, sharing applications are not limited to mobile clients but must also run on cloud services to share data between users. In this paper, we leverage the trust that users have in their mobile OSes to vet cloud services. To do so, we define a new Secure Application Flow Enforcement (SAFE) framework, which requires cloud services to attest to a system stack that will enforce policies provided by the mobile OS for user data. We implement a mobile OS that enforces SAFE policies on unmodified mobile apps and two systems for enforcing policies on untrusted cloud services. Using these prototypes, we demonstrate that it is possible to enforce existing user privacy policies on unmodified applications. |
[51] |
Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon
Kim, Arvind Krishnamurthy, Masoud Moshref, Dan R. K. Ports, and Peter
Richtarik.
Scaling distributed machine learning with in-network aggregation.
arXiv preprint 1903.06701, arXiv, February 2019.
version 2, Sep. 2020.
[ bib |
.pdf ]
Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide an efficient solution that speeds up training by up to 5.5× for a number of real-world benchmark models. |
[52] |
Adriana Szekeres, Michael Whittaker, Naveen Kr. Sharma, Jialin Li, Arvind
Krishnamurthy, Irene Zhang, and Dan R. K. Ports.
Meerkat: Scalable replicated transactions following the
zero-coordination principle.
Technical Report UW-CSE-2019-11-02, University of Washington CSE,
Seattle, WA, USA, November 2019.
[ bib |
.pdf ]
Traditionally, the high cost of network communication between servers has hidden the impact of cross-core coordination in replicated systems. However, new technologies, like kernel-bypass networking and faster network links, have exposed hidden bottlenecks in distributed systems. |
[53] |
Hang Zhu, Zhihao Bai, Jialin Li, Ellis Michael, Dan R. K. Ports, Ion Stoica,
and Xin Jin.
Harmonia: Near-linear scalability for replicated storage with
in-network conflict detection.
Proceedings of the VLDB Endowment, 13(3):376--389, November
2019.
[ bib |
.pdf ]
Distributed storage employs replication to mask failures and improve availability. However, these systems typically exhibit a hard tradeoff between consistency and performance. Ensuring consistency introduces coordination overhead, and as a result the system throughput does not scale with the number of replicas. We present Harmonia, a replicated storage architecture that exploits the capability of new-generation programmable switches to obviate this tradeoff by providing near-linear scalability without sacrificing consistency. To achieve this goal, Harmonia detects read-write conflicts in the network, which enables any replica to serve reads for objects with no pending writes. Harmonia implements this functionality at line rate, thus imposing no performance overhead. We have implemented a prototype of Harmonia on a cluster of commodity servers connected by a Barefoot Tofino switch, and have integrated it with Redis. We demonstrate the generality of our approach by supporting a variety of replication protocols, including primary-backup, chain replication, Viewstamped Replication, and NOPaxos. Experimental results show that Harmonia improves the throughput of these protocols by up to 10x for a replication factor of 10, providing near-linear scalability up to the limit of our testbed. |
[54] |
Hang Zhu, Zhihao Bai, Jialin Li, Ellis Michael, Dan R. K. Ports, Ion Stoica,
and Xin Jin.
Harmonia: Near-linear scalability for replicated storage with
in-network conflict detection.
arXiv preprint 1904.08964, arXiv, April 2019.
[ bib |
.pdf ]
Distributed storage employs replication to mask failures and improve availability. However, these systems typically exhibit a hard tradeoff between consistency and performance. Ensuring consistency introduces coordination overhead, and as a result the system throughput does not scale with the number of replicas. We present Harmonia, a replicated storage architecture that exploits the capability of new-generation programmable switches to obviate this tradeoff by providing near-linear scalability without sacrificing consistency. To achieve this goal, Harmonia detects read-write conflicts in the network, which enables any replica to serve reads for objects with no pending writes. Harmonia implements this functionality at line rate, thus imposing no performance overhead. We have implemented a prototype of Harmonia on a cluster of commodity servers connected by a Barefoot Tofino switch, and have integrated it with Redis. We demonstrate the generality of our approach by supporting a variety of replication protocols, including primary-backup, chain replication, Viewstamped Replication, and NOPaxos. Experimental results show that Harmonia improves the throughput of these protocols by up to 10X for a replication factor of 10, providing near-linear scalability up to the limit of our testbed. |
[55] |
Jialin Li, Jacob Nelson, Xin Jin, and Dan R. K. Ports.
Pegasus: Load-aware selective replication with an in-network
coherence directory.
Technical Report UW-CSE-18-12-01, University of Washington CSE,
Seattle, WA, USA, December 2018.
[ bib |
.pdf ]
High performance distributed storage systems face the challenge of load imbalance caused by skewed and dynamic workloads. This paper introduces Pegasus, a new storage architecture that leverages new-generation programmable switch ASICs to balance load across storage servers. Pegasus uses selective replication of the most popular objects in the data store to distribute load. Using a novel in-network coherence directory, the Pegasus switch tracks and manages the location of replicated objects. This allows it to achive load-aware forwarding and dynamic rebalancing for replicated keys, while still guaranteeing data coherence. The Pegasus design is practical to implement as it stores only forwarding metadata in the switch data plane. The resulting system improves the 99% tail latency of a distributed in-memory key-value store by more than 95%, and yields up to a 9x throughput improvement under a latency SLO -- results which hold across a large set of workloads with varying degrees of skewness, read/write ratio, and dynamism. |
[56] |
Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan
R. K. Ports.
Building consistent transactions with inconsistent replication.
ACM Transactions on Computer Systems, 35(4):12, December
2018.
[ bib |
.pdf ]
Application programmers increasingly prefer distributed storage systems with strong consistency and distributed transactions (e.g., Google's Spanner) for their strong guarantees and ease of use. Unfortunately, existing transactional storage systems are expensive to use -- in part because they require costly replication protocols, like Paxos, for fault tolerance. In this paper, we present a new approach that makes transactional storage systems more affordable: we eliminate consistency from the replication protocol while still providing distributed transactions with strong consistency to applications. |
[57] |
Jialin Li, Ellis Michael, and Dan R. K. Ports.
Eris: Coordination-free consistent transactions using network
multi-sequencing (extended version).
Technical Report UW-CSE-TR-17-10-01, University of Washington CSE,
Seattle, WA, USA, October 2017.
[ bib |
.pdf ]
Distributed storage systems aim to provide strong consistency and isolation guarantees on an architecture that is partitioned across multiple shards for scalability and replicated for fault-tolerance. Traditionally, achieving all of these goals has required an expensive combination of atomic commitment and replication protocols -- introducing extensive coordination overhead. Our system, Eris, takes a very different approach. It moves a core piece of concurrency control functionality, which we term multi-sequencing, into the datacenter network itself. This network primitive takes on the responsibility for consistently ordering transactions, and a new lightweight transaction protocol ensures atomicity. The end result is that Eris avoids both replication and transaction coordination overhead: we show that it can process a large class of distributed transactions in a single round-trip from the client to the storage system without any explicit coordination between shards or replicas. It provides atomicity, consistency, and fault-tolerance with less than 10% overhead -- achieving throughput 4.5--35x higher and latency 72--80% lower than a conventional design on standard benchmarks. |
[58] |
Ellis Michael, Dan R. K. Ports, Naveen Kr. Sharma, and Adriana Szekeres.
Recovering shared objects without stable storage (extended version).
Technical Report UW-CSE-17-08-01, University of Washington CSE,
Seattle, WA, USA, August 2017.
[ bib |
.pdf ]
This paper considers the problem of building fault-tolerant shared objects when processes can crash and recover but lose their persistent state on recovery. This Diskless Crash-Recovery (DCR) model matches the way many long-lived systems are built. We show that it presents new challenges, as operations that are recorded at a quorum may not persist after some of the processes in that quorum crash and then recover. |
[59] |
Brandon Holt, James Bornholt, Irene Zhang, Dan R. K. Ports, Mark Oskin, and
Luis Ceze.
Disciplined inconsistency.
Technical Report UW-CSE-TR-16-06-01, University of Washington CSE,
Seattle, WA, USA, June 2016.
[ bib |
.pdf ]
Distributed applications and web services, such as online stores or social networks, are expected to be scalable, available, responsive, and fault-tolerant. To meet these steep requirements in the face of high round-trip latencies, network partitions, server failures, and load spikes, applications use eventually consistent datastores that allow them to weaken the consistency of some data. However, making this transition is highly error-prone because relaxed consistency models are notoriously difficult to understand and test. |
[60] |
Jialin Li, Ellis Michael, Adriana Szekeres, Naveen Kr. Sharma, and Dan R. K.
Ports.
Just say NO to Paxos overhead: Replacing consensus with network
ordering (extended version).
Technical Report UW-CSE-TR-16-09-02, University of Washington CSE,
Seattle, WA, USA, 2016.
[ bib |
.pdf ]
Distributed applications use replication, implemented by protocols like Paxos, to ensure data availability and transparently mask server failures. This paper presents a new approach to achieving replication in the data center without the performance cost of traditional methods. Our work carefully divides replication responsibility between the network and protocol layers. The network orders requests but does not ensure reliable delivery -- using a new primitive we call ordered unreliable multicast (OUM). Implementing this primitive can be achieved with near-zero-cost in the data center. Our new replication protocol, Network-Ordered Paxos (NOPaxos), exploits network ordering to provide strongly consistent replication without coordination. The resulting system not only outperforms both latency- and throughput-optimized protocols on their respective metrics, but also yields throughput within 2% and latency within 16 us of an unreplicated system -- providing replication without the performance cost. |
[61] |
Ellis Michael, Dan R. K. Ports, Naveen Kr. Sharma, and Adriana Szekeres.
Providing stable storage for the diskless crash-recovery failure
model.
Technical Report UW-CSE-TR-16-08-02, University of Washington CSE,
Seattle, WA, USA, August 2016.
[ bib |
.pdf ]
Many classic protocols in the fault tolerant distributed computing literature assume a Crash-Fail model in which processes either are up, or have crashed and are permanently down. While this model is useful, it does not fully capture the difficulties many real systems must contend with. In particular, real-world systems are long-lived and must have a recovery mechanism so that crashed processes can rejoin the system and restore its fault-tolerance. When processes are assumed to have access to stable storage that is persistent across failures, the Crash-Recovery model is trivial. However, because disk failures are common and because having a disk on a protocol's critical path is often performance concern, diskless recovery protocols are needed. While such protocols do exist in the state machine replication literature, several well-known protocols have flawed recovery mechanisms. We examine these errors to elucidate the problem of diskless recovery and present our own protocol for providing virtual stable storage, transforming any protocol in the Crash-Recovery with stable storage model into a protocol in the Diskless Crash-Recover model. |
[62] |
Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan
R. K. Ports.
When is operation ordering required in replicated transactional
storage?
IEEE Data Engineering Bulletin, 39(1):27--38, March 2016.
[ bib |
.pdf ]
Today's replicated transactional storage systems typically have a layered architecture, combining protocols for transaction coordination, consistent replication, and concurrency control. These systems generally require costly strongly-consistent replication protocols like Paxos, which assign a total order to all operations. To avoid this cost, we ask whether all replicated operations in these systems need to be strictly ordered. Recent research has yielded replication protocols that can avoid unnecessary ordering, e.g., by exploiting commutative operations, but it is not clear how to apply these to replicated transaction processing systems. We answer this question by analyzing existing transaction processing designs in terms of which replicated operations require ordering and which simply require fault tolerance. We describe how this analysis leads to our recent work on TAPIR, a transaction protocol that efficiently provides strict serializability by using a new replication protocol that provides fault tolerance but not ordering for most operations. |
[63] |
Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind
Krishnamurthy, Thomas Anderson, and Timothy Roscoe.
Arrakis: The operating system is the control plane.
ACM Transactions on Computer Systems, 33(4), November 2015.
[ bib |
.pdf ]
Recent device hardware trends enable a new approach to the design of network server operating systems. In a traditional operating system, the kernel mediates access to device hardware by server applications to enforce process isolation as well as network and disk security. We have designed and implemented a new operating system, Arrakis, that splits the traditional role of the kernel in two. Applications have direct access to virtualized I/O devices, allowing most I/O operations to skip the kernel entirely, while the kernel is re-engineered to provide network and disk protection without kernel mediation of every operation. We describe the hardware and software changes needed to take advantage of this new abstraction, and we illustrate its power by showing improvements of 2 to 5 x in latency and 9x throughput for a popular persistent NoSQL store relative to a well-tuned Linux implementation. |
[64] |
Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan
R. K. Ports.
Building consistent transactions with inconsistent replication
(extended version).
Technical Report UW-CSE-2014-12-01 v2, University of Washington CSE,
October 2015.
[ bib |
.pdf ]
Application programmers increasingly prefer distributed storage systems with strong consistency and distributed transactions (e.g., Google's Spanner) for their strong guarantees and ease of use. Unfortunately, existing transactional storage systems are expensive to use -- in part because they require costly replication protocols, like Paxos, for fault tolerance. In this paper, we present a new approach that makes transactional storage systems more affordable: we eliminate consistency from the replication protocol while still providing distributed transactions with strong consistency to applications. |
[65] |
Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble.
Tales of the tail: Hardware, OS, and application-level sources of
tail latency.
Technical Report UW-CSE-14-04-01, University of Washington CSE,
Seattle, WA, USA, April 2014.
[ bib |
.pdf ]
Interactive services often have large-scale parallel implementations. To deliver fast responses, the median and tail latencies of a service's components must be low. In this paper, we explore the hardware, OS, and application-level sources of poor tail latency in high throughput servers executing on multi-core machines. |
[66] |
Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Arvind Krishnamurthy,
Thomas Anderson, and Timothy Roscoe.
Arrakis: The operating system is the control plane.
Technical Report UW-CSE-13-10-01, version 2.0, University of
Washington CSE, Seattle, WA, USA, May 2014.
[ bib |
.pdf ]
Recent device hardware trends enable a new approach to the design of network server operating systems. In a traditional operating system, the kernel mediates access to device hardware by server applications, to enforce process isolation as well as network and disk security. We have designed and implemented a new operating system, Arrakis, that splits the traditional role of the kernel in two. Applications have direct access to virtualized I/O devices, allowing most I/O operations to skip the kernel entirely, while the kernel is re-engineered to provide network and disk protection without kernel mediation of every operation. We describe the hardware and software changes needed to take advantage of this new abstraction, and we illustrate its power by showing 2-5x end-to-end latency and 9x throughput improvements for a popular persistent NoSQL store relative to a well-tuned Linuxv implementation. |
[67] | Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. Building consistent transactions with inconsistent replication. Technical Report UW-CSE-2014-12-01, University of Washington CSE, December 2014. [ bib | .pdf ] |
[68] |
Peter Hornyack, Luis Ceze, Steven D. Gribble, Dan R. K. Ports, and Henry M.
Levy.
A study of virtual memory usage and implications for large memory.
Technical report, University of Washington CSE, Seattle, WA, 2013.
[ bib |
.pdf ]
The mechanisms now used to implement virtual memory - pages, page tables, and TLBs - have worked remarkably well for over fifty years. However, they are beginning to show their age due to current trends, such as significant increases in physical memory size, emerging data-intensive applications, and imminent non-volatile main memory. These trends call into question whether page-based address-translation and protection mechanisms remain viable solutions in the future. In this paper, we present a detailed study of how modern applications use virtual memory. Among other topics, our study examines the footprint of mapped regions, the use of memory protection, and the overhead of TLBs. Our results suggest that a segment-based translation mechanism, together with a fine-grained protection mechanism, merit consideration for future systems. |
[69] |
Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Arvind Krishnamurthy,
Thomas Anderson, and Timothy Roscoe.
Arrakis: The operating system is the control plane.
Technical Report UW-CSE-13-10-01, University of Washington CSE,
Seattle, WA, USA, October 2013.
[ bib |
.pdf ]
Recent device hardware trends enable a new approach to the design of network servers. In a traditional operating system, the kernel mediates access to device hardware by server applications, to enforce process isolation as well as network and disk security. We have designed and implemented a new operating system, Arrakis, that splits the traditional role of the kernel in two. Applications have direct access to virtualized I/O devices, allowing most I/O operations to skip the kernel entirely. The Arrakis kernel operates only in the control plane. We describe the the hardware and software changes needed to take advantage of this new abstraction, and we illustrate its power by showing significant latency and throughput improvements for network server applications relative to a well-tuned Linux implementation. |
[70] |
Dan R. K. Ports.
Application-Level Caching with Transactional Consistency.
Ph.d. thesis, Massachusetts Institute of Technology, Cambridge, MA,
USA, June 2012.
[ bib |
.pdf ]
Distributed in-memory application data caches like memcached are a popular solution for scaling database-driven web sites. These systems increase performance significantly by reducing load on both the database and application servers. Unfortunately, such caches present two challenges for application developers. First, they cannot ensure that the application sees a consistent view of the data within a transaction, violating the isolation properties of the underlying database. Second, they leave the application responsible for locating data in the cache and keeping it up to date, a frequent source of application complexity and programming errors. |
[71] |
Dan R. K. Ports and Kevin Grittner.
Serializable snapshot isolation in PostgreSQL.
Proceedings of the VLDB Endowment, 5(12):1850--1861, August
2012.
[ bib |
slides (.pdf) |
.pdf ]
This paper describes our experience implementing PostgreSQL's new serializable isolation level. It is based on the recently-developed Serializable Snapshot Isolation (SSI) technique. This is the first implementation of SSI in a production database release as well as the first in a database that did not previously have a lock-based serializable isolation level. We reflect on our experience and describe how we overcame some of the resulting challenges, including the implementation of a new lock manager, a technique for ensuring memory usage is bounded, and integration with other PostgreSQL features. We also introduce an extension to SSI that improves performance for read-only transactions. We evaluate PostgreSQL's serializable isolation level using several benchmarks and show that it achieves performance only slightly below that of snapshot isolation, and significantly outperforms the traditional two-phase locking approach on read-intensive workloads. |
This file was generated by bibtex2html 1.99.