# A power-tunable algorithm to compute single-source shortest paths

Sara Karamati · Jeffrey Young · Richard (Rich) Vuduc

November 17, 2017 — University of Colorado at Colorado Springs



hpcgarage.org/uccs





#### **ACM Doctoral Dissertation Award Winner (1985)**





http://www.amazon.com/Connection-Machine-Artificial-Intelligence/dp/0262580977/ref=sr\_1\_1?ie=UTF8&qid=1319571691&sr=8-1
http://www.technologyreview.com/files/1158/1106\_Q-A.tif.jpg

# The Connection Machine Daniel Hillis

#### hpcgarage.org/memsys16



#### **ACM Doctoral Dissertation Award Winner (1985)**







| ures and Their Relationship to Physics |     |  |  |  |  |
|----------------------------------------|-----|--|--|--|--|
| ce is No Good                          | 137 |  |  |  |  |
| is No Good                             | 137 |  |  |  |  |
| ysics                                  | 139 |  |  |  |  |
| of Computation                         | 142 |  |  |  |  |
|                                        | 144 |  |  |  |  |
|                                        | 145 |  |  |  |  |
|                                        | 173 |  |  |  |  |
|                                        |     |  |  |  |  |



# What physical constraint limits speed today?

hpcgarage.org/uccs



# What physical constraint limits speed today?







# COMPETITOR A Power

QUALCOMM SNAPDRAGON™S4 PROCESSOR



(An aside on the relationship between computational performance and power)



# ImageNet Dataset



Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575. [web]

# IM **GENET**





### (4 GPUs) x (250 Watts / GPU) x (1 week) ~ 0.6 billion Joules







### (4 GPUs) x (250 Watts / GPU) x (1 week) ~ 0.6 billion Joules

# (1 brain) x (1 year)

# x (20 Watts / brain) ~ 0.6 billion Joules 71223832

#### gettyimages\* Roderick Chen





### (4 GPUs) x (250 Watts / GPU) x (1 week) ~ 0.6 billion Joules

22

(1 brain) x (1 year)

x (20 Watts / brain) ~ 0.6 billion Joules 71223832 gettyimages\* Roderick Chen













|         | Jetson TK1                              |
|---------|-----------------------------------------|
|         | ARM A15<br>(32-bit, 2.3 GHz, 4+1 cores) |
|         | 192 core Kepler, 326 GF/s (per          |
| ory     | 2 GB LPDDR3                             |
| ge      | 16 GB eMMC                              |
| orking  | Ethernet                                |
| Factor  | Dev board                               |
|         | USB, HDMI, Serial                       |
| se Date | 2014                                    |

| 2341 | 2431 | 2521 | 2611 | 2701 | 2791 | 2881 | 2971 | 3061 | 3151 | 3241 | 3331 | 3421 | 3511 | 3601 |
|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|



















#### hpcgarage.org/uccs









| 2341<br>2431 | 2521 | 2611 | 2701 | 2791 | 2881 | 2971 | 3061 | 3151 | 3241 | 3331 | 3421 | 3511 | 3601 | 3691 | 3781 | 3871 |
|--------------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
|--------------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|





















|4

Main question of this talk:

# Can you design an algorithm in a way that you can control its power?

We are interested in "algorithmic" methods that complement techniques available in hardware, like DVFS, and systems software or middleware.

#### hpcgarage.org/uccs







# A first principle:

# Relationships among time, energy, and power.

J. Choi, D. Bedard, R. Fowler, R. Vuduc. "A roofline model of energy." In IPDPS'13. J. Choi, M. Dukhan, X. Liu, R. Vuduc. "Algorithmic time, energy, and power on candidate HPC building blocks." In IPDPS'14.

#### hpcgarage.org/uccs







#### Time ~ ?

Energy ~ ?

Power = Energy / Time

#### hpcgarage.org/uccs



*Time ~ (# of operations) / (number of processors)* Energy ~ (# of operations) *Power = Energy / Time ~ (number of processors) = Speedup* 

#### hpcgarage.org/uccs



# Time ~ (# of operations) / (number of processors)

Energy ~ (# of operations)

Power = Energy / Time ~ (number of processors) = Speedup



*Time ~ (# of operations) / (number of processors)* 

# **Energy** ~ (# of operations)

Power = Energy / Time ~ (number of processors) = Speedup

#### hpcgarage.org/uccs



# *Time ~ (# of operations) / (number of processors)*

Energy ~ (# of operations)

# Power = Energy / Time ~ (number of processors) = Speedup

#### hpcgarage.org/uccs





*Time ~ (# of operations) / (number of processors)* Energy ~ (# of operations) Power = Energy / Time ~ (number of processors) = Speedup

Conclusion: To save time & energy: Must reduce work (# ops or cost/op) To save power: Must slow down (e.g., use fewer cores)

#### hpcgarage.org/uccs



Time ~ (# of operations) / (number of processors) Energy ~ (# of operations) Power = Energy / Time ~ (number of processors) = Speedup

onclusion: To save power: Must slow down (e.g., use fewer cores)

# To save time & energy: Must reduce work (# ops or cost/op)

hpcgarage.org/uccs









onclusion: To save time & energy: Must reduce work (# ops or cost/op) To save power: Must slow down (e.g., use fewer cores)

Power = Energy / Time ~ (number of processors) = Speedup

Energy ~ (# of operations)

*Time ~ (# of operations) / (number of processors)* 





Relative power





Main question of this talk:

# Can you design an algorithm in a way that you can control its power?

We are interested in "algorithmic" methods that complement techniques available in hardware, like DVFS, and systems software or middleware.

#### hpcgarage.org/uccs









# Yes! Example: A power-tunable graph algorithm to compute single-source shortest paths (SSSP).

Sara Karamati (Ph.D. student), Dr. Jeff Young (research faculty), R. Vuduc – new, unpublished work

#### hpcgarage.org/uccs



### Baseline: Fastest, work-efficient "delta-stepping-like" method\*

- **Tunable work-parallelism tradeoff**
- Tuned for a GPU and run on an NVIDIA Jetson TK1, which has tunable core frequencies (10x) and memory frequencies (3x)
- No preprocessing shortcuts, a la PHAST\*\*

\* Based on GunRock implementations of Davidson, Baxter, Garland, and Owens (IPDPS'14) \*\* Delling et al. "PHAST: Hardware-accelerated shortest path trees" (JPDC'10)

#### hpcgarage.org/uccs

|              | Jetson TK1                                     |
|--------------|------------------------------------------------|
| CPU          | ARM A15<br>( <i>32-bit, 2.3 GHz, 4+1 cores</i> |
| GPU          | 192 core Kepler, 326 GF/s (pe                  |
| Memory       | 2 GB LPDDR3                                    |
| Storage      | 16 GB eMMC                                     |
| Networking   | Ethernet                                       |
| Form Factor  | Dev board                                      |
| I/O          | USB, HDMI, Serial                              |
| Release Date | 2014                                           |













hpcgarage.org/uccs





hpcgarage.org/uccs





hpcgarage.org/uccs





hpcgarage.org/uccs




hpcgarage.org/uccs









































# **Power ~ Parallelism ~ Queue sizes**



# **Power ~ Parallelism ~ Queue sizes**

# What is the effect of delta $(\delta)$ ?

hpcgarage.org/uccs



# What is the effect of delta $(\delta)$ ?



hpcgarage.org/uccs

## (road network)



# What is the effect of delta $(\delta)$ ?











### (road network)

## Mean power

## Max power

# delta: 8





## Time

### (scale-free)

## Mean power

### Max power



Observation:

# Delta ( $\delta$ ) is a tuning parameter that can be used to control power-time tradeoffs.

But how to choose it? It is input-dependent. And, in the literature, it is always treated as a fixed a priori parameter with little guidance on its ideal value.









Sara's insight:

# Treat $\delta$ as a parameter to be learned and controlled, dynamically.

### hpcgarage.org/uccs





# Recall: Near+Far == stages.







# Recall: Near+Far == stages.

# Intermediate frontier (queue) sizes







# Simple models between stages...



 $X^2(k) \sim (degree) * X^1(k)$ 

hpcgarage.org/uccs

### Try to estimate α ("learn") [∆(k) $X^{3}(k)$ $X^4(k)$ Bisectrebalancer frontier







 $\hat{X}_k^{(2)} = d \cdot X_k^{(1)}$ 





### hpcgarage.org/uccs





### hpcgarage.org/uccs



















Regularize to stabilize the estimator during the early iterations and use online fitting method (e.g., stochastic gradient descent)

hpcgarage.org/uccs













# Constrain: $\leq P$



# Largest queue







# Constrain: $\leq P$

### X<sup>3</sup>(k)

# Largest queue



 $\hat{X}_{k+1}^{(1)} \leq \frac{P}{-1}$ 



## Estimate the effect of a change in $\delta$

 $\hat{X}_{k+1}^{(1)} = X_k^{(4)} + \alpha \cdot \Delta \delta_k$ 







## Estimate the effect of a change in $\delta$



# Estimator






### Estimate the effect of a change in $\delta$



### Parameter







Cool feature:

# User sets P, the max parallelism, not $\delta!$

The controller chooses the hard-to-set  $\delta$  fully automatically, adjusting it as frequently as every iteration. Not discussed: One can prove that the controller is bounded-input bounded-output (BIBO) stable, meaning frontier sizes will be "well-behaved."







Next question:

# **Does it work?**

(Experimental results)

#### hpcgarage.org/uccs





























hpcgarage.org/uccs

### **Baseline**: **High variance**

Near+Fa





hpcgarage.org/uccs

### **Baseline**: **High variance**

Near+Fa



### Self-tuning controller: **Tight distributions at a** desired target



hpcgarage.org/uccs

### Baseline: **High variance**

Near+Fa



Next question:

# Can it save power or improve performance?

Compare against the baseline with a fixed  $\delta$  and hardware-based dynamic frequency scaling.

#### hpcgarage.org/uccs

















### (x=1, y=1): **Baseline near+far** & auto-DFS

1.1

















### But even when it can't save energy, it trades time for power gracefully. (Shown here: scale-free network)

1.1 1.2 Average Power





Limitation 1 (open-question):

# Choosing P is not the same as asking for max power, which was our motivation.

There are limits on dynamic power measurement, which is needed to provide feedback to this scheme. But it would likely be easy to incorporate because of the model-based approach.





Limitation 2 (observational):

# Power and energy savings are not big.

We observed ~ 40% speedups and ~ 15% reductions in maximum power consumption over hardware-only DFS. These savings cannot be bigger because the system baseline or "constant" power is high — it is upwards of 50% or more of maximum power, so the amount of dynamically controllable power is small.







Conclusions:

The "dynamic SSSP" algorithm improves on near+far, even SSSP easier to use.

The control-based scheme can be applied to any algorithm that is a "sequence of (filter) banks" (e.g., others in Gunrock), though the models may need specialization.

Energy savings are possible, especially when combined with hardware-only techniques like DFS.

hpcgarage.org/uccs

# when ignoring power. It's easier to choose P than $\delta$ , making this











## **Rich Vuduc**

#### High-performance computing, scalable parallel algorithms & software



**Computational Science and Engineering** 

hpcgarage.org/uccs

# ( hpcgarage



