of 22

# NIPS2014_ModelingNAG

group meeting presentation
Published on: Mar 3, 2016
Published in: Data & Analytics

#### Transcripts - NIPS2014_ModelingNAG

• 1. A Diﬀerential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 speaker: kv MCLab, CITI Academia Sinicas January 29, 2015 Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 1 / 22
• 2. Overview 1 Introduction 2 Derivation of the ODE 3 Equivalence between the ODE and Nesterov’s scheme 4 A family of generalized Nesterov’s schemes Continuous Optimization Composite Optimization 5 Accelerating to linear convergence by restarting Numerical examples 6 Discussion Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 2 / 22
• 3. Introduction: NAG Nesterov’s Accelerated Gradient Scheme xk = yk−1 − s f (yk−1) (1) yk = xk + k − 1 k + 2 (xk − xk−1) (2) For ﬁxed step size s = 1/L, where L is Lipschitz constant of f , this scheme exhibits the convergence rate f (xk) − f ∗ ≤ O( L||x0 − x∗||2 k2 ) This improvement relies on the introduction to momentum xk − xk−1 and the particularly tuned coeﬃcient (k − 1)/(k + 2) Theorem (L-Lipschitz Continuity) || f (x) − f (y)|| ≤ L||x − y|| Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 3 / 22
• 4. Introduction: Second order ODE Derive a second order ordinary diﬀerential equation(ODE), which is the exact limit of Nesterov’s scheme by taking small step size ¨X + 3 t ˙X + f (X) = 0 (3) for t > 0, with initial condition X(0) = x0, ˙X(0) = 0; here x0 is the starting point in Nesterov’s scheme. ˙X denotes to velocity and ¨X is acceleration. Time parameter in this ODE is related to step size t ∼ k √ s In this paper, FL denotes the class of convex function f with L-Lipschitz continuous gradients deﬁned on Rn; Sµ denotes the class of µ-strongly convex function f on Rn Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 4 / 22
• 5. Derivation of the ODE Assmue f ∈ FL for L > 0, combine two equations of (1) and applying rescaling give xk+1 − xk √ s = k − 1 k + 2 xk − xk−1 √ s − √ s f (yk) (4) Introducfelis e the ansatz xk ∼ X(k √ s) for smooth curve X(t) deﬁne for t > 0. With these approximations, we get Taylor expansions: (xk+1 − xk)/ √ s = ˙X(t) + 1 2 ¨X(t) √ s + o( √ s) (xk − xk−1)/ √ s = ˙X(t) − 1 2 ¨X(t) √ s + o( √ s) √ s f (yk) = √ s f (X(t)) + o( √ s) where the third equality we use yk − X(t) = o(1) Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 5 / 22
• 6. Derivation of the ODE The formula (4) can be rewritten as ˙X(t) + 1 2 ¨X(t) √ s + o( √ s) = (1 − 3 √ s t ){ ˙X(t) − 1 2 ¨X(t) √ s + o( √ s)} − √ s f (X(t)) + o( √ s) By comparing the coeﬃcients of √ s, we obtain ¨X + 3 t ˙X + f (X) = 0 Theorem (Well posed ODE, Existence and Uniqueness) For any f ∈ F∞ := UL>0FL and any x0 ∈ Rn, the ODE (3) with initial conditions X(0) = x0, ˙X(0) = 0 has an unique global solution X. Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 6 / 22
• 7. Equivalence between the ODE and Nesterov’s scheme We study the stable step size for numerically solving ODE. The ﬁnite diﬀerence approximation of (3) by the forward Euler method X(t + ∆t) − 2X(t) + X(t + ∆t) ∆t2 + 3 t X(t) − X(t − ∆t) ∆t + f (X(t)) = 0 which is equivalent to X(t + ∆t) = (2 − 3∆t t )X(t) − ∆t2 f (X(t)) − (1 − ∆t t )X(t − ∆t) Assuming that f is suﬃciently smooth, for small perturbation. The characteristic equation of this ﬁnite diﬀerence scheme is approximately (identify k = t/∆t) det{λ2 − (2 − ∆t2 2 f − 3∆t t )λ + 1 − 3∆t t } = 0 (5) For numerical stability, all the roots of (5) should lie on unit circle. Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 7 / 22
• 8. Equivalence between the ODE and Nesterov’s scheme Now, we exhibit approximate equivalence between the ODE and Nesterov’s scheme in terms of convergence rate. Theorem (Discrete Nesterov Scheme 3.1) For any f ∈ FL, the sequence {xk} in 1 with step size s ≤ 1/L obeys f (xk) − f ∗ ≤ 2||x0 − x2||2 s(k + 1)2 First result indicates the trajectory of ODE (3) closely resembles the sequence {xk} in terms of the convergence rate Theorem (Continuous ODE Scheme 3.2) For any f ∈ F∞, let X(t) be the unique global solution to (3) with initial conditions X(0) = x0, ˙X(0) = 0, for any t > 0 f (X(t)) − f ∗ ≤ 2||x0 − x∗||2 t2 Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 8 / 22
• 9. Proof of Theorem 3.2 Consider energy functional deﬁned as ε(t) := t2 (f (X(t)) − f ∗ ) + 2||X + t 2 ˙X − x∗ ||2 whose time derivative is ... Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 9 / 22
• 10. A family of generalized Nesterov’s schemes Exploit the power of ODE. We would be interested in studying the ODE (3) with the number of 3 appearing the coeﬃcient of ˙X/t replaced by a general constant r as ¨X + r t ˙X + f (X) = 0, X(0) = x0, ˙X(0) = 0 (6) Using the argument similar to theorem 2.1, this ODE is guaranteed to assume a unique global solution for any f ∈ F∞ Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 10 / 22
• 11. Generalized Nesterov’s Scheme: Continuous Optimization Theorem (4.1) Suppose r > 3 and let X be the unique solution to (6) for some f ∈ F∞. Then X(t) obeys f (X(t)) − f ∗ ≤ (r − 1)2||x0 − x∗||2 2t2 and ∞ 0 t(f (X(t)) − f ∗ )dt ≤ (r − 1)2||x0 − x∗||2 2(r − 3) Theorem (4.2) For any f ∈ Sµ,L(Rn), the unique solution X to (6) with r ≥ 9/2 obeys f (X(t)) − f ∗ ≤ Cr5/2||x0 − x∗||2 t3√ µ Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 11 / 22
• 12. Generalized Nesterov’s Scheme: Continuous Optimization For example, the solution to (6) with f (x) = ||x||2/2 is X(t) = 2 r−1 2 Γ((r + 1)/2)J(r−1)/2(t) t(r−1)/2 where J(r−1)/2(.) is the ﬁrst kind of Bessel function of order (r − 1)/2. For large t, this Bessel function obeys J(r−1)/2(t) = 2/(πt)(cos(t − rπ/4) + O(1/t). Hence f (X(t)) − f ∗ ≤ ||x0 − x∗ ||2 /tr Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 12 / 22
• 13. Generalized Nesterov’s Scheme: Composite Optimization Skip this part min x∈Rn f (x) = g(x) + h(x) where g ∈ FL for some L > 0 and h is convex on Rn with possible extended value ∞. Deﬁne proximal subgradient Gs(x) := x − arg minz{||z − (x − s g(x))||2/(2s) + h(z)} s Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 13 / 22
• 14. Accelerating to linear convergence by restarting This work provides a new restarting strategy, called speed restarting scheme. The underlying motivation is to maintain relatively high velocity ˙X along the trajectory. Deﬁnition 5.1 For ODE (3) with X(0) = x0, ˙X(0) = 0, let T = T(f , x0) = sup{t > 0 : ∀u ∈ (0, t), || ˙X(u)||2 du > 0} be the speed restarting time. In words, T is the ﬁrst time the velocity || ˙X|| decreases. Indeed, f (X(t)) is the decreasing function before time T, for t < T, df (X(t)) dt =< f (X), ˙X >= − 3 t || ˙X||2 − 1 2 || ˙X||2 dt < 0 Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 14 / 22
• 15. Accelerating to linear convergence by restarting The speed restarting ODE is thus ¨X(t) + 3 tsr ˙X(t) + f (X(t)) = 0 (7) where tsr is set to zero whenever < ˙X, ¨X >= 0. Theorem (5.2) There exists positive constants c1 and c2, which only depend on the condition number L/µ, such that for any f ∈ Sµ,L, we have f (Xsr (t)) − f (x∗ ) ≤ c1L||x0 − x∗||2 2 exp−c2t √ L The theorem guarantees linear convergence of solution to (7). This is new result in the literature. Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 15 / 22
• 16. Numerical examples Below we present a discrete analog to restarted scheme. Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 16 / 22
• 17. Quadratic Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 17 / 22
• 18. Log-sum-exp Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 18 / 22
• 19. Matrix compleltion Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 19 / 22
• 20. Lasso in l1-constrainted form with large space design Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 20 / 22
• 21. References W. Su, S. Boyd, E. Candes (2014) A Diﬀerential Equation for Modeling Nesterovs Accelerated Gradient Method: Theory and Insights NIPS 2014 Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 21 / 22
• 22. qs? Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG January 29, 2015 22 / 22