뉴턴방법 (Newton’s Method)

AI 수학/최적화

뉴턴방법 (Newton’s Method)

깊은대학 2022. 4. 8. 15:47

경사하강법(gradient descent)이 어떤 함수의 최소값을 향한 방향을 계산하는데 1차 미분을 사용하는 반면 뉴턴방법(Newton's method)는 2차 미분을 사용한다. 따라서 뉴턴방법이 경사하강법보다는 성능이 훨씬 좋다.

제약조건이 없는 최적화 문제는 다음과 같다.

\[ \min_{\mathbf{x}} f(\mathbf{x}) \tag{1} \]

여기서 \(\mathbf{x} \in \mathbb{R}^n\) 은 최적화 변수이고, \(f(\mathbf{x})\) 는 목적함수(objective function)이다. 목적함수는 두 번 미분가능하다고 가정한다.

뉴턴방법의 기본 개념은 최적화 변수의 시작값(starting point) \(\mathbf{x}\) 에서 목적함수 \(f(\mathbf{x})\) 를 2차 함수로 근사화하고, 목적함수 대신에 이 2차 근사 함수를 최소화하는 것이다. 그리고 이 최소값을 다음 이터레이션(iteration) 스텝의 시작값으로 설정하여 위 절차를 반복한다.

목적함수 \(f(\mathbf{x})\) 를 테일러 시리즈(Taylor series)로 전개하고 2차항에서 절삭하면 다음과 같다.

\[ \begin{align} f(\mathbf{x}+ \Delta \mathbf{x}) & \approx f(\mathbf{x})+ \nabla_{\mathbf{x}}^T f(\mathbf{x}) \Delta \mathbf{x}+ \frac{1}{2} \Delta \mathbf{x}^T \nabla_{\mathbf{x}}^2 f(\mathbf{x}) \Delta \mathbf{x} \tag{2} \\ \\ &= \hat{f}(\mathbf{x}+ \Delta \mathbf{x}) \end{align} \]

여기서 \(\nabla_{\mathbf{x}} f(\mathbf{x})\) 는 \(f(\mathbf{x})\) 의 그래디언트(gradient)이고 \(\nabla_{\mathbf{x}}^2 f(\mathbf{x})\) 는 헤시안(Hessian)이다. 2차 근사함수 \(\hat{f}(\mathbf{x})\) 의 최소값을 구하기 위하여 미분하면 다음과 같다.

\[ \nabla_{\Delta \mathbf{x}} \hat{f}(\mathbf{x}+ \Delta \mathbf{x})=0= \nabla_{\mathbf{x}} f(\mathbf{x})+ \nabla_{\mathbf{x}}^2 f(\mathbf{x}) \Delta \mathbf{x} \tag{3} \]

따라서 \(\nabla_{\mathbf{x}}^2 f(\mathbf{x}) \gt 0\) 이라면 \( \hat{f} (\mathbf{x}+ \Delta \mathbf{x})\) 를 최소화하는 \(\Delta \mathbf{x}\) 는 다음과 같다.

\[ \Delta \mathbf{x}= - \left( \nabla_{\mathbf{x}}^2 f(\mathbf{x}) \right)^{-1} \nabla_{\mathbf{x}} f(\mathbf{x}) \tag{4} \]

여기서 \(\Delta \mathbf{x}\) 를 뉴턴스텝(Newton step) 이라고 한다. 결국 \(\hat{f}(\mathbf{x})\) 의 최소값은 \(\mathbf{x}+\Delta \mathbf{x}\) 에서 얻을 수 있으며, 이 값은 다음 이터레이션의 시작값으로 설정된다.

\[ \mathbf{x} \gets \mathbf{x} - \left( \nabla_{\mathbf{x}}^2 f(\mathbf{x}) \right)^{-1} \nabla_{\mathbf{x}} f(\mathbf{x}) \tag{5} \]

뉴턴스텝 \(\Delta \mathbf{x}\) 는 \(\nabla_{\mathbf{x}}^2 f(\mathbf{x}) \gt 0 \) 이고 \(\nabla_{\mathbf{x}} f(\mathbf{x}) \ne 0\) 이라면 목적함수의 하강 방향(descent direction)을 가리킨다. 즉 \(\Delta \mathbf{x}\) 는 \(f(\mathbf{x}+\eta \Delta \mathbf{x}) \lt f(\mathbf{x})\) 이 되도록 만드는 \(0\) 보다 큰 어떤 스칼라 값 \(\eta\) 가 존재한다. 증명은 다음과 같다.

먼저 \(\phi (\eta)=f(\mathbf{x}+\eta \Delta \mathbf{x})\) 로 놓고 \(\phi\) 를 \(\eta\) 로 미분한다.

\[ \phi^\prime (\eta)=\nabla_{\mathbf{x}}^T f(\mathbf{x})\Delta \mathbf{x} + \eta \Delta \mathbf{x}^T \nabla_{\mathbf{x}}^T f(\mathbf{x}) \Delta \mathbf{x} + \cdots \tag{6} \]

\(\eta=0\) 에서 위 미분을 계산하면 식 (3)과 (4)에 의해서 다음과 같이 된다.

\[ \begin{align} \phi^\prime (0) &= \nabla_{\mathbf{x}}^T f(\mathbf{x}) \Delta \mathbf{x} \tag{7} \\ \\ &= - \Delta \mathbf{x}^T \left( \nabla_{\mathbf{x}}^2 f(\mathbf{x}) \right)^{-1} \Delta \mathbf{x} \lt 0 \end{align} \]

식 (7)에 의하면 결국 \(\phi(\eta) \lt \phi(0)\) 이 되는 \(\eta \gt 0\) 이 존재한다는 것이며 이는 어떤 \(\eta \gt 0 \) 에 대해서 다음 부등식이 성립함을 의미한다.

\[ f(\mathbf{x}+ \eta \Delta \mathbf{x}) \lt f(\mathbf{x}) \tag{8} \]

그렇다면 뉴턴방법으로 이터레이션하면서 구한 \(f(\mathbf{x})\) 의 값과 식 (1)의 최적해 \(p^\star\) 의 차이는 어떻게 계산할 수 있을까. 이터레이션을 어딘가에서 멈추려면 이 값을 아는 것이 중요하다. \(f(\mathbf{x})-p^\star\) 는 식 (2)와 (4)를 이용하여 다음과 같이 근사적으로 계산할 수 있다.

\[ \begin{align} f(\mathbf{x})-p^\star & \approx f(\mathbf{x})-f(\mathbf{x}+\Delta \mathbf{x}) \tag{9} \\ \\ &= - \nabla_{\mathbf{x}}^T f(\mathbf{x}) \Delta \mathbf{x}-\frac{1}{2} \Delta \mathbf{x}^T \nabla_{\mathbf{x}}^2 f(\mathbf{x}) \Delta \mathbf{x} \\ \\ &= \frac{1}{2} \Delta \mathbf{x}^T \nabla_{\mathbf{x}}^2 f(\mathbf{x}) \Delta \mathbf{x} \\ \\ &= \frac{1}{2} \nabla_{\mathbf{x}}^T f(\mathbf{x}) \left( \nabla_{\mathbf{x}}^2 f(\mathbf{x}) \right)^{-1} \nabla_{\mathbf{x}} f(\mathbf{x}) \\ \\ &= \frac{1}{2} q^2 (\mathbf{x}) \end{align} \]

여기서 \(q (\mathbf{x})\) 를 뉴턴감소량(Newton decrement)이라고 한다.

\[ q (\mathbf{x}) = \left( \nabla_{\mathbf{x}}^T f(\mathbf{x}) \left( \nabla_{\mathbf{x}}^2 f(\mathbf{x}) \right)^{-1} \nabla_{\mathbf{x}} f(\mathbf{x}) \right)^{\frac{1}{2}} \tag{10} \]

뉴턴감소량이 일정 수준 이하로 작아지면 최적해에 거의 근접했다는 신호로 볼 수 있으므로 이테레이션을 멈출 수 있다.

뉴턴방법을 정리하면 다음과 같다.

   0. 시작값 \(\mathbf{x}\) 와 종료조건 값 \(\epsilon \) 을 설정한다.
   1. 다음을 반복한다.

         [1] 뉴턴스텝과 뉴턴감소량을 계산한다.
                 \( \Delta \mathbf{x} = - \left( \nabla_{\mathbf{x}}^2 f(\mathbf{x}) \right)^{-1} \nabla_{\mathbf{x}} f(\mathbf{x}) \)
                 \( q^2 (\mathbf{x})= \nabla_{\mathbf{x}}^T f(\mathbf{x}) \left( \nabla_{\mathbf{x}}^2 f(\mathbf{x}) \right)^{-1} \nabla_{\mathbf{x}} f(\mathbf{x}) \)
         [2] 뉴턴감소량을 종료조건 값 \(\epsilon\) 보다 작으면 이터레이션을 중지한다.
                \( \frac{1}{2} q^2 ( \mathbf{x}) \le \epsilon \)
         [3] 스텝사이즈 \(\eta\) 를 계산한다(백트래킹 라인서치 방법이나 미리 설정된 값 등 사용).
         [4] \(\mathbf{x}\) 를 업데이트 한다.
                 \( \mathbf{x} \gets \mathbf{x}+ \eta \Delta \mathbf{x} \)

뉴턴방법의 단점은 최적화 변수의 개수가 클 때 헤시안 행렬 \(\nabla_{\mathbf{x}}^2 f(\mathbf{x})\) 과 식 (4)를 계산하는데 많은 자원이 소요된다는 데에 있다. 몇가지 보완 알고리즘이 있는데 이에 대해서는 나중에 알아보도록 하자. 그리고 헤시안 행렬이 정정행렬(positive-definite matrix)이 아닌 경우에도 문제가 되는데, 이는 Levenberg-Marquardt 수정 알고리즘을 통해서 극복할 수 있다.