ARock: an Algorithmic Framework for Asynchronous Parallel Coordinate Updates
Abstract
Finding a fixed point to a nonexpansive operator, i.e., , abstracts many problems in numerical linear algebra, optimization, and other areas of scientific computing. To solve fixedpoint problems, we propose ARock, an algorithmic framework in which multiple agents (machines, processors, or cores) update in an asynchronous parallel fashion. Asynchrony is crucial to parallel computing since it reduces synchronization wait, relaxes communication bottleneck, and thus speeds up computing significantly. At each step of ARock, an agent updates a randomly selected coordinate based on possibly outofdate information on . The agents share through either global memory or communication. If writing is atomic, the agents can read and write without memory locks.
Theoretically, we show that if the nonexpansive operator has a fixed point, then with probability one, ARock generates a sequence that converges to a fixed points of . Our conditions on and step sizes are weaker than comparable work. Linear convergence is also obtained.
We propose special cases of ARock for linear systems, convex optimization, machine learning, as well as distributed and decentralized consensus problems. Numerical experiments of solving sparse logistic regression problems are presented.
∎
Contents
 1 Introduction
 2 Applications
 3 Convergence
 4 Experiments
 5 Conclusion
 6 Acknowledgements
 A Derivation of certain updates
 B Derivation of asyncparallel ADMM for decentralized optimization
1 Introduction
Technological advances in data gathering and storage have led to a rapid proliferation of big data in diverse areas such as climate studies, cosmology, medicine, the Internet, and engineering house2014big . The data involved in many of these modern applications are large and grow quickly. Therefore, parallel computational approaches are needed. This paper introduces a new approach to asynchronous parallel computing with convergence guarantees.
In a synchronous(sync) parallel iterative algorithm, the agents must wait for the slowest agent to finish an iteration before they can all proceed to the next one (Figure 0(a)). Hence, the slowest agent may cripple the system. In contract, the agents in an asynchronous(async) parallel iterative algorithm run continuously with little idling (Figure 0(b)). However, the iterations are disordered, and an agent may carry out an iteration without the newest information from other agents.
Asynchrony has other advantages bertsekas1991some : the system is more tolerant to computing faults and communication glitches; it is also easy to incorporate new agents.
On the other hand, it is more difficult to analyze asynchronous algorithms and ensure their convergence. It becomes impossible to find a sequence of iterates that one completely determines the next. Nonetheless, we let any update be a new iteration and propose an asyncparallel algorithm (ARock) for the generic fixedpoint iteration. It converges if the fixedpoint operator is nonexpansive (Def. 1) and has a fixed point.
Let be Hilbert spaces and be their Cartesian product. For a nonexpansive operator , our problem is to
(1) 
Finding a fixed point to is equivalent to finding a zero of denoted by such that . Hereafter, we will use both and for convenience.
Problem (1) is widely applicable in linear and nonlinear equations, statistical regression, machine learning, convex optimization, and optimal control. A generic framework for problem (1) is the Krasnosel’skiĭ–Mann (KM) iteration krasnosel1955two :
(2) 
where is the step size. If — the set of fixed points of (zeros of ) — is nonempty, then the sequence converges weakly to a point in and converges strongly to 0. The KM iteration generalizes algorithms in convex optimization, linear algebra, differential equations, and monotone inclusions. Its special cases include the following iterations: alternating projection, gradient descent, projected gradient descent, proximalpoint algorithm, ForwardBackward Splitting (FBS) passty1979ergodic , DouglasRachford Splitting (DRS) lions1979splitting , a threeoperator splitting davis2015three , and the Alternating Direction Method of Multipliers (ADMM) lions1979splitting ; glowinski1975approximation .
In ARock, a set of agents, , solve problem (1) by updating the coordinates , , in a random and asynchronous fashion. Algorithm 1 describes the framework. Its special forms for several applications are given in Section 2 below.
Whenever an agent updates a coordinate, the global iteration counter increases by one. The th update is applied to , where is an independent random variable. Each coordinate update has the form:
(3) 
where is a scalar whose range will be set later, and is used to normalize nonuniform selection probabilities. In the uniform case, namely, for all , we have , which simplifies the update (3) to
(4) 
Here, the point is what an agent reads from global memory to its local cache and to which is applied, and denotes the state of in global memory just before the update (3) is applied. In a syncparallel algorithm, we have , but in ARock, due to possible updates to by other agents, can be different from . This is a key difference between syncparallel and asyncparallel algorithms. In Subsection 1.2 below, we will establish the relationship between and as
(5) 
where and is the maximum number of other updates to during the computation of (3). Equation (5) has appeared in liu2014asynchronous .
The update (3) is only computationally worthy if is much cheaper to compute than . Otherwise, it is more preferable to apply the full KM update (2). In Section 2, we will present several applications that have the favorable structures for ARock. The recent work peng2016coordinate studies coordinate friendly structures more thoroughly.
The convergence of ARock (Algorithm 1) is stated in Theorems 3.2 and 3.3. Here we include a shortened version, leaving detailed bounds to the full theorems:
Theorem 1.1 (Global and linear convergence)
Let be a nonexpansive operator that has a fixed point. Let be the sequence generated by Algorithm 1 with properly bounded step sizes . Then, with probability one, converges weakly to a fixed point of . This convergence becomes strong if has a finite dimension.
In addition, if is demicompact (see Definition 2 below), then with probability one, converges strongly to a fixed point of .
Furthermore, if is quasistrongly monotone (see Definition 1 below), then has a unique fixedpoint , converges strongly to with probability one, and converges to 0 at a linear rate.
In the theorem, the weak convergence result only requires to be nonexpansive and has a fixed point. In addition, the computation requires: (a) bounded step sizes; (b) random coordinate selection; and (c) a finite maximal delay . Assumption (a) is standard, and we will see the bound can be . Assumption (b) is essential to both the analysis and the numerical performance of our algorithms. Assumption (c) is not essential; an infinite delay with a light tail is allowed (but we leave it to future work). The strong convergence result applies to all the examples in Section 2, and the linear convergence result applies to Examples 2.2 and 2.4 when the corresponding operator is quasistrongly monotone. Step sizes are discussed in Remarks 2 and 4.
1.1 On random coordinate selection
ARock employs random coordinate selection. This subsection discusses its advantages and disadvantages.
Its main disadvantage is that an agent cannot caching the data associated with a coordinate. The variable and its related data must be either stored in global memory or passed through communication. A secondary disadvantage is that pseudorandom number generation takes time, which becomes relatively significant if each coordinate update is cheap. (The network optimization examples in Subsections 2.3 and 2.6.2 are exceptions, where data are naturally stored in a distributed fashion and random coordinate assignments are the results of Poisson processes.)
There are several advantages of random coordinate selection. It realizes the userspecified update frequency for every component , , even when different agents have different computing powers and different coordinate updates cost different amounts of computation. Therefore, random assignment ensures load balance. The algorithm is also fault tolerant in the sense that if one or more agents fail, it will still converge to a fixedpoint of . In addition, it has been observed numerically on certain problems chang2008coordinate that random coordinate selection accelerates convergence.
1.2 Uncoordinated memory access
In ARock, since multiple agents simultaneously read and update in global memory, — the result of that is read from global memory by an agent to its local cache for computation — may not equal for any , that is, may never be consistent with a state of in global memory. This is known as inconsistent read. In contrast, consistent read means that for some , i.e., is consistent with a state of that existed in global memory.
We illustrate inconsistent read and consistent read in the following example, which is depicted in Figure 2. Consider and initially, at time . Suppose at time , agent 2 updates from 0 to 1, yielding ; then, at time , agent 3 updates from 0 to 2, further yielding . Suppose that agent 1 starts reading from the first component at . For consistent read (Figure 1(a)), agent 1 acquires a memory lock and only releases the lock after finishing reading all of , , , and . Therefore, agent 1 will read in . Inconsistent read, however, allows agent 1 to proceed without a memory lock: agent 1 starts reading at (Figure 1(b)) and reaches the last component, , after ; since is updated by agent 3 prior to it is read by agent 1, agent 1 has read , which is different from any of , and .
Even with inconsistent read, each component is consistent under the atomic coordinate update assumption, which will be defined below. Therefore, we can express what has been read in terms of the changes of individual coordinates. In the above example, the first change is , which is added to just before time by agent 2, and the second change is , added to just before time by agent 3. The inconsistent read by agent 1, which gives the result , equals .
We have demonstrated that can be inconsistent, but each of its coordinates is consistent, that is, for each , is an everexisted state of among . Suppose that , where . Therefore, can be related to through the interim changes applied to . Let be the index set of these interim changes. If , then ; otherwise, . In addition, we have . Since the global counter is increased after each coordinate update, updates to and , , must occur at different ’s and thus . Therefore, by letting and noticing for where , we have , which is equivalent to (5). Here, we have made two assumptions:

atomic coordinate update: a coordinate is not further broken to smaller components during an update; they are all updated at once.

bounded maximal delay : during any update cycle of an agent, in global memory is updated at most times by other agents.
When each coordinate is a single scalar, updating the scalar is a single atomic instruction on most modern hardware, so the first assumption naturally holds, and our algorithm is lockfree. The case where a coordinate is a block that includes multiple scalars is discussed in the next subsection.
1.2.1 Block coordinate
In the “block coordinate” case (updating a block of several coordinates each time), the atomic coordinate update assumption can be met by either employing a percoordinate memory lock or taking the following dualmemory approach: Store two copies of each coordinate in global memory, denoting them as and ; let a bit point to the active copy; an agent will only read from the active copy ; before an agent updates the components of , it obtains a memory lock to the inactive copy to prevent other agents from simultaneously updating it; then after it finishes updating , flip the bit so that other agents will begin reading from the updated copy. This approach never blocks any read of , yet it eliminates inconsistency.
1.3 Straightforward generalization
Our asyncparallel coordinate update scheme (3) can be generalized to (overlapping) block coordinate updates after a change to the step size. Specifically, the scheme (3) can be generalized to
(6) 
where is randomly drawn from a set of operators (), , following the probability , (, and ). The operators must satisfy and for some .
Let , which has ; then (6) reduces to (3). If is endowed with a metric such that (e.g., the metric in the CondatVũ primaldual splitting condat2013primal ; vu2013splitting ), then we have
In general, multiple coordinates can be updated in (6). Consider linear , , where for each . Then, for , we have
1.4 Special cases
If there is only one agent (), ARock (Algorithm 1) reduces to randomized coordinate update, which includes the special case of randomized coordinate descent nesterov2012rcd for convex optimization. Syncparallel coordinate update is another special case of ARock corresponding to . In both cases, there is no delay, i.e., and . In addition, the step size can be more relaxed. In particular, if , , then we can let , , for any , or when is averaged (see Definition 2 for the definition of an averaged operator).
1.5 Related work
Chazan and Miranker chazan1969chaotic proposed the first asyncparallel method in 1969. The method was designed for solving linear systems. Later, asyncparallel methods have been successful applied in many fields, e.g., linear systems avron2014revisiting ; bethune2014performance ; FSS1997asynaddSch ; rosenfeld1969case , nonlinear problems BMR1997asynmultisplit ; baudet1978asynchronous , differential equations aharoni2000parallel ; AAI1998implicit ; Chau20081126 ; donzis2014asynchronous , consensus problems LMS1986asynchronous ; leifang_information_2005 , and optimization hsieh2015passcode ; liu2014asynchronous ; liu2013asynchronous ; tai2002convergence ; zhang2014asynchronous . We review the theory for asyncparallel fixedpoint iteration and its applications.
General fixed point problems. Totally asyncparallel^{1}^{1}1“Totally asynchronous” means no upper bound on the delays; however, other conditions are required, for example: each coordinate must be updated infinitely many times. By default, “asynchronous” in this paper assumes a finite maximum delay. iterative methods for a fixedpoint problem go back as early as to Baudet baudet1978asynchronous , where the operator was assumed to be Pcontraction.^{2}^{2}2An operator is Pcontraction if , componentwise, where denotes the vector with components , and is a nonnegative matrix with a spectral radius strictly less than 1. Later, Bertsekas bertsekas1983distributed generalized the Pcontraction assumption and showed convergence. Frommer and Szyld frommer2000asynchronous reviewed the theory and applications of totally asyncparallel iterations prior to 2000. This review summarized convergence results under the conditions in bertsekas1983distributed . However, ARock can be applied to solve many more problems since our nonexpansive assumption, though not strictly weaker than Pcontraction, is more pervasive. As opposed to totally asynchronous methods, Tseng, Bertsekas, and Tsitsiklis bertsekas1989parallel ; TB1990partially assumed quasinonexpansiveness^{3}^{3}3An operator is quasinonexpansive if , , . and proposed an asyncparallel method, converging under an additional assumption, which is difficult to justify in general but can be established for problems such as linear systems and strictly convex network flow problems bertsekas1989parallel ; TB1990partially .
The above works assign coordinates in a deterministic manner. Different from them, ARock is stochastic, works for nonexpansive operators, and is more applicable.
Linear, nonlinear, and differential equations. The first asyncparallel method for solving linear equations was introduced by Chazan and Miranker in chazan1969chaotic . They proved that on solving linear systems, Pcontraction was necessary and sufficient for convergence. The performance of the algorithm was studied by Iain et al. bethune2014performance ; rosenfeld1969case on different High Performance Computing (HPC) architectures. Recently, Avron et al. avron2014revisiting revisited the asyncparallel coordinate update and showed its linear convergence for solving positivedefinite linear systems. Tarazi and Nabih el1982some extended the poineering work chazan1969chaotic to solving nonlinear equations, and the asyncparallel methods have also been applied for solving differential equations, e.g., in aharoni2000parallel ; AAI1998implicit ; Chau20081126 ; donzis2014asynchronous . Except for avron2014revisiting , all these methods are totally asyncparallel with the Pcontraction condition or its variants. On solving a positivedefinite linear system, avron2014revisiting made assumptions similar to ours, and it obtained better linear convergence rate on that special problem.
Optimization. The first asyncparallel coordinate update gradientprojection method was due to Bertsekas and Tsitsiklis bertsekas1989parallel . The method solves constrained optimization problems with a smooth objective and simple constraints. It was shown that the objective gradient sequence converges to zero. Tseng tseng1991rateasyn further analyzed the convergence rate and obtained local linear convergence based on the assumptions of isocost surface separation and a local Lipschitz error bound. Recently, Liu et al. liu2013asynchronous developed an asyncparallel stochastic coordinate descent algorithm for minimizing convex smooth functions. Later, Liu and Wright liu2014asynchronous suggested an asyncparallel stochastic proximal coordinate descent algorithm for minimizing convex composite objective functions. They established the convergence of the expected objectiveerror sequence for convex functions. Hsieh et al. hsieh2015passcode proposed an asyncparallel dual coordinate descent method for solving regularized empirical risk minimization problems. Other asyncparallel approaches include asynchronous ADMM hong2014distributed ; wei2013on ; zhang2014asynchronous ; iutzeler2013asynchronous . Among them, wei2013on ; iutzeler2013asynchronous use an asynchronous clock, and hong2014distributed ; zhang2014asynchronous use a central node to update the dual variable; they do not deal with delay or inconsistency. Asyncparallel stochastic gradient descent methods have also been considered in nedic2001distributed ; recht2011hogwild .
Our framework differs from the recent surge of the aforementioned syncparallel and asyncparallel coordinate descent algorithms (e.g., peng2013parallel ; kyrola2011parallel ; liu2013asynchronous ; liu2014asynchronous ; hsieh2015passcode ; richtarik2015parallel ). While they apply to convex function minimization, ARock covers more cases (such as ADMM, primaldual, and decentralized methods) and also provides sequence convergence. In Section 2, we will show that some of the existing asyncparallel coordinate descent algorithms are special cases of ARock, through relating their optimality conditions to nonexpansive operators. Another difference is that the convergence of ARock only requires a nonexpansive operator with a fixed point, whereas properties such as strong convexity, bounded feasible set, and bounded sequence, which are seen in some of the recent literature for asyncparallel convex minimization, are unnecessary.
Others. Besides solving equations and optimization problems, there are also applications of asyncparallel algorithms to optimal control problems LMS1986asynchronous , network flow problems ESMG1996asynflex , and consensus problems of multiagent systems leifang_information_2005 .
1.6 Contributions
Our contributions and techniques are summarized below:

ARock is the first asyncparallel coordinate update framework for finding a fixed point to a nonexpansive operator.

By introducing a new metric and establishing stochastic Fejér monotonicity, we show that, with probability one, ARock converges to a point in the solution set; linear convergence is obtained for quasistrongly monotone operators.

Based on ARock, we introduce an asyncparallel algorithm for linear systems, asyncparallel ADMM algorithms for distributed or decentralized computing problems, as well as asyncparallel operatorsplitting algorithms for nonsmooth minimization problems. Some problems are treated in they asyncparallel fashion for the first time in history. The developed algorithms are not straightforward modifications to their serial versions because their underlying nonexpansive operators must be identified before applying ARock.
1.7 Notation, definitions, background of monotone operators
Throughout this paper, denotes a separable Hilbert space equipped with the inner product and norm , and denotes the underlying probability space, where , , and are the sample space, algebra, and probability measure, respectively. The map , where is the Borel algebra, is an valued random variable. Let denote either a sequence of deterministic points in or a sequence of valued random variables, which will be clear from the context, and let denote the th coordinate of . In addition, we let denote the smallest algebra generated by . “Almost surely” is abbreviated as “a.s.”, and the product space of is denoted by . We use and for strong convergence and weak convergence, respectively.
We define as the set of fixed points of operator , and, in the product space, we let .
Definition 1
An operator is Lipschitz, where , if it satisfies , . In particular, is nonexpansive if , and contractive if .
Definition 2
Consider an operator .

is averaged with , if there is a nonexpansive operator such that , where is the identity operator.

is cocoercive with , if

is strongly monotone, where , if it satisfies When the inequality holds for , is monotone.

is quasistrongly monotone, where , if it satisfies . When the inequality holds for , is quasimonotone.

is demicompact petryshyn1966construction at if for every bounded sequence in such that , there exists a strongly convergent subsequence.
Averaged operators are nonexpansive. By the CauchySchwarz inequality, a cocoercive operator is Lipschitz; the converse is generally untrue, but true for the gradients of convex differentiable functions. Examples are given in the next section.
2 Applications
In this section, we provide some applications that are special cases of the fixedpoint problem (1). For each application, we identify its nonexpansive operator (or the corresponding operator ) and implement the conditions in Theorem 1.1. For simplicity, we use the uniform distribution, , and apply the simpler update (4) instead of (3).
2.1 Solving linear equations
Consider the linear system where is a nonsingular matrix with nonzero diagonal entries. Let , where and are the diagonal and offdiagonal parts of , respectively. Let and . Then the system is equivalent to the fixedpoint problem , where is nonexpansive if the spectral norm satisfies . The iteration is widely known as the Jacobi algorithm. Let . Each update involves multiplying just the th row of to and adding the th entry of , so we arrive at the following algorithm.
Proposition 1
(bauschke2011convex, , Example 22.5) Suppose that is Lipschitz continuous with . Then, is strongly monotone.
2.2 Minimize convex smooth function
Consider the optimization problem
(7) 
where is a closed proper convex differentiable function and is Lipschitz continuous, . Let . As is convex and differentiable, is a minimizer of if and only if is a zero of . Note that is cocoercive. By Lemma 1, is nonexpansive. Applying ARock, we have the following iteration:
(8) 
where . Note that needs a structure that makes it cheap to compute . Let us give two such examples: (i) quadratic programming: , where and only depends on a part of and ; (ii) sum of sparsely supported functions: and , where each depends on just a few variables.
Theorem 3.2 below guarantees the convergence of if . In addition, If is restricted strongly convex, namely, for any and , where is the solution set to (7), we have for some , then is quasistrongly monotone with modulus . According to Theorem 3.3, iteration (8) converges at a linear rate if the step size meets the condition therein.
Our convergence and rates are given in term of the distance to the solution set . In comparison, the results in the work liu2013asynchronous are given in terms of objective error under the assumption of a uniformly bounded . In addition, their step size decays like for some depending on , and our is better. Under similar assumptions, Bertsekas and Tsitsiklis (bertsekas1989parallel, , Section 7.5) also describes an algorithm for (7) and proves only subsequence convergence (bertsekas1989parallel, , Proposition 5.3) in .
2.3 Decentralized consensus optimization
Consider that agents in a connected network solve the consensus problem of minimizing , where is the shared variable and the convex differentiable function is held privately by agent . We assume that is Lipschitz continuous for all . A decentralized gradient descent algorithm nedic2009distributed can be developed based on the equivalent formulation
(9) 
where and is the socalled mixing matrix satisfying: if and only if . For , if , then agent can communicate with agent ; otherwise they cannot. We assume that is symmetric and doubly stochastic. Then, the decentralized consensus algorithm nedic2009distributed can be expressed as , where is a matrix with its th row equal to ; see yuan2013convergence . The computation of involves communication between agents, and is independently computed by each agent . The iteration is equivalent to the gradient descent iteration applied to . To apply our algorithm, we let
If the agents are independent Poisson processes and that each agent has activation rate , then the probability that agent activates before other agents is equal to larson1981urban and therefore our random sample scheme holds and ARock applies naturally. The algorithm is summarized as follows:
2.4 Minimize smooth nonsmooth functions
Consider the problem
(10) 
where is closed proper convex and is convex and Lipschitz differentiable with . Problems in the form of (10) arise in statistical regression, machine learning, and signal processing and include wellknown problems such as the support vector machine, regularized leastsquares, and regularized logistic regression. For any and scalar , define the proximal operator and the reflectiveproximal operator as
(11) 
respectively, and define the following forwardbackward operator Because is averaged and is averaged, is averaged for (bauschke2011convex, , Propositions 4.32 and 4.33). Define . When we apply Algorithm 1 to to solve (10), and assume is separable in all coordinates, that is, , the update for the th selected coordinate is
(12) 
Examples of separable functions include norm, norm square, the Huber function, and the indicator function of box constraints, i.e., . They all have simple maps. If , then the convergence is guaranteed by Theorem 3.2. To show linear convergence, we need to assume that is strongly convex. Then, Proposition 2 below shows that is a quasicontractive operator, and by Proposition 1, operator is quasistrongly monotone. Finally, linear convergence and its rate follow from Theorem 3.3.
Proposition 2
Assume that is a closed proper convex function, and is Lipschitz differentiable and strongly convex with modulus . Let . Then, both and are quasicontractive operators.
Proof
We first show that is a quasicontractive operator. Note
where the first inequality follows from the BaillonHaddad theorem^{4}^{4}4Let be a convex differentiable function. Then, is Lipschitz if and only if it is cocoercive. and the second one from the strong convexity of . Hence, is quasicontractive if . Since is convex, is firmly nonexpansive, and thus we immediately have the quasicontractiveness of from that of .
2.5 Minimize nonsmooth nonsmooth functions
Consider
(13) 
where both and are closed proper convex and their maps are easy to compute. Define the PeacemanRachford lions1979splitting operator:
Since both and are nonexpansive, their composition is also nonexpansive. Let When applying ARock to to solve problem (13), the update (6) reduces to:
(14) 
where we use instead of since the limit of is not a solution to (13); instead, a solution must be recovered via . The convergence follows from Theorem 3.2 and that is nonexpansive. If either or is strongly convex, then is contractive and thus by Theorem 3.3, ARock converges linearly. Finer convergence rates follow from DavisYin2014 ; DavisYin2014c . A naive implementation of (14) is
(15a)  
(15b)  
(15c) 
where and are intermediate variables. Note that the order in which the proximal operators are applied to and affects both YanYin2014 and whether coordinatewise updates can be efficiently computed. Next, we present two special cases of (13) in Subsections 2.5.1 and 2.6 and discuss how to efficiently implement the update (15).
2.5.1 Feasibility problem
Suppose that are closed convex subsets of with a nonempty intersection. The problem is to find a point in the intersection. Let be the indicator function of the set , that is, if and otherwise. The feasibility problem can be formulated as the following
Let , , and . We can implement (15) as follows (see Appendix A for the stepbystep derivation):
(16a)  
(16b)  
(16c) 
The update (16) can be implemented as follows. Let global memory hold , as well as . At the th update, an agent independently generates a random number , then reads as and as , and finally computes and updates in global memory according to (16). Since is maintained in global memory, the agent updates according to . This implementation saves each agent from computing (16a) or reading all . Each agent only reads and , executes (16b), and updates (16c) and .
2.6 Asyncparallel ADMM
This is another application of (15). Consider
(17) 
where and are Hilbert spaces, and are bounded linear operators. We apply the update (15) to the Lagrange dual of (17) (see gabay1983chapter for the derivation):
(18) 
where , , and and denote the convex conjugates of and , respectively. The proximal maps induced by and can be computed via solving subproblems that involve only the original terms in (17): can be computed by (see Appendix A for the derivation)
(19) 
and by