# Scalable FPGA Implementation of Dynamic Programming for Optimal Control of Hybrid Electrical Vehicles

Frans Skarman, Oscar Gustafsson

Linköping University

























Dynamic programming is not fast enough on CPUs



Dynamic programming is not fast enough on CPUs



A scalable FPGA architecture that makes real-time use possible

The dynamic programming algorithm



The dynamic programming algorithm



The dynamic programming algorithm



The dynamic programming algorithm



The dynamic programming algorithm



The dynamic programming algorithm



The dynamic programming algorithm



The dynamic programming algorithm



The dynamic programming algorithm



The dynamic programming algorithm



The dynamic programming algorithm



The dynamic programming algorithm



The dynamic programming algorithm



• 5.12 km Search horizon with 10 m steps

• 5.12 km Search horizon with 10 m steps

#### States

- 30 velocity steps
- 30 State of Charge (SOC) steps

• 5.12 km Search horizon with 10 m steps

#### States

- 30 velocity steps
- 30 State of Charge (SOC) steps

#### Inputs

- 30 steps of electric torque
- 30 steps of conbustion torque
- 6 gears





#### pprox 2 seconds for real time



#### pprox 2 seconds for real time

#### > 1 model execution every clock cycle

#### **Vehicle Model**



#### **Vehicle Model**



#### C++ model converted with HLS

Pipelined with Initiation Interval = 1











• 2D linear interpolation



- 2D linear interpolation
- Requires 4 memory accesses per value



- 2D linear interpolation
- Requires 4 memory accesses per value
- Simultaneous writeback is required

### **Memory Read Partitioning**



### **Memory Read Partitioning**



- 4 separate memories
- x, y evenness determines indexing

### **Memory Write Partitioning**



11











#### **Schedule and Performance**



#### **Schedule and Performance**



- Implemented in Spade HDL Implemented
- Tools:
  - Vitis HLS 2022.1
  - Vivado 2022.1
  - AMD Virtex UltraScale+ xcvu13pfhga2104-3-e

| EUs | $f_{ m max}$ (MHz) | CCs required      | Run time (s) | Speedup      | CLB   | DSP  | BRAM | URAM |
|-----|--------------------|-------------------|--------------|--------------|-------|------|------|------|
| 1   | 370                | $2.488\cdot 10^9$ | 6.73         | $18 \times$  | 10196 | 515  | 238  | 154  |
| 2   | 336                | $1.244\cdot 10^9$ | 3.70         | 34 	imes     | 21384 | 1030 | 476  | 154  |
| 4   | 335                | $6.220\cdot 10^8$ | 1.85         | $68 \times$  | 41654 | 2060 | 952  | 154  |
| 6   | 288                | $4.147\cdot 10^8$ | 1.44         | $87 \times$  | 62627 | 3090 | 1428 | 154  |
| 9   | 272                | $2.764\cdot 10^8$ | 1.02         | $123 \times$ | 94874 | 4635 | 2142 | 154  |

| EUs | $f_{ m max}$ (MHz) | CCs required       | Run time (s) | Speedup      | CLB   | DSP  | BRAM | URAM |
|-----|--------------------|--------------------|--------------|--------------|-------|------|------|------|
| 1   | 370                | $2.488\cdot 10^9$  | 6.73         | $18 \times$  | 10196 | 515  | 238  | 154  |
| 2   | 336                | $1.244\cdot 10^9$  | 3.70         | 34 	imes     | 21384 | 1030 | 476  | 154  |
| 4   | 335                | $6.220 \cdot 10^8$ | 1.85         | $68 \times$  | 41654 | 2060 | 952  | 154  |
| 6   | 288                | $4.147\cdot 10^8$  | 1.44         | 87 	imes     | 62627 | 3090 | 1428 | 154  |
| 9   | 272                | $2.764\cdot 10^8$  | 1.02         | $123 \times$ | 94874 | 4635 | 2142 | 154  |

| EUs | $f_{ m max}$ (MHz) | CCs required      | <b>Run time</b> (s) | Speedup      | CLB   | DSP  | BRAM | URAM |
|-----|--------------------|-------------------|---------------------|--------------|-------|------|------|------|
| 1   | 370                | $2.488\cdot 10^9$ | 6.73                | $18 \times$  | 10196 | 515  | 238  | 154  |
| 2   | 336                | $1.244\cdot 10^9$ | 3.70                | 34 	imes     | 21384 | 1030 | 476  | 154  |
| 4   | 335                | $6.220\cdot 10^8$ | 1.85                | 68 	imes     | 41654 | 2060 | 952  | 154  |
| 6   | 288                | $4.147\cdot 10^8$ | 1.44                | 87 	imes     | 62627 | 3090 | 1428 | 154  |
| 9   | 272                | $2.764\cdot 10^8$ | 1.02                | $123 \times$ | 94874 | 4635 | 2142 | 154  |

| EUs | $f_{ m max}$ (MHz) | CCs required      | Run time (s) | Speedup      | CLB   | DSP  | BRAM | URAM |
|-----|--------------------|-------------------|--------------|--------------|-------|------|------|------|
| 1   | 370                | $2.488\cdot 10^9$ | 6.73         | $18 \times$  | 10196 | 515  | 238  | 154  |
| 2   | 336                | $1.244\cdot 10^9$ | 3.70         | $34 \times$  | 21384 | 1030 | 476  | 154  |
| 4   | 335                | $6.220\cdot 10^8$ | 1.85         | 68 	imes     | 41654 | 2060 | 952  | 154  |
| 6   | 288                | $4.147\cdot 10^8$ | 1.44         | 87 	imes     | 62627 | 3090 | 1428 | 154  |
| 9   | 272                | $2.764\cdot 10^8$ | 1.02         | $123 \times$ | 94874 | 4635 | 2142 | 154  |

Single threaded Xeon W-1250: **126 seconds** 

| EUs | $f_{ m max}$ (MHz) | CCs required      | <b>Run time</b> (s) | Speedup      | CLB   | DSP  | BRAM | URAM |
|-----|--------------------|-------------------|---------------------|--------------|-------|------|------|------|
| 1   | 370                | $2.488\cdot 10^9$ | 6.73                | $18 \times$  | 10196 | 515  | 238  | 154  |
| 2   | 336                | $1.244\cdot 10^9$ | 3.70                | $34 \times$  | 21384 | 1030 | 476  | 154  |
| 4   | 335                | $6.220\cdot 10^8$ | 1.85                | $68 \times$  | 41654 | 2060 | 952  | 154  |
| 6   | 288                | $4.147\cdot 10^8$ | 1.44                | $87 \times$  | 62627 | 3090 | 1428 | 154  |
| 9   | 272                | $2.764\cdot 10^8$ | 1.02                | $123 \times$ | 94874 | 4635 | 2142 | 154  |

Single threaded Xeon W-1250: **126 seconds** 

| EUs | $f_{ m max}$ (MHz) | <b>CCs required</b> | <b>Run time</b> (s) | Speedup      | CLB   | DSP  | BRAM | URAM |
|-----|--------------------|---------------------|---------------------|--------------|-------|------|------|------|
| 1   | 370                | $2.488\cdot 10^9$   | 6.73                | $18 \times$  | 10196 | 515  | 238  | 154  |
| 2   | 336                | $1.244\cdot 10^9$   | 3.70                | 34 	imes     | 21384 | 1030 | 476  | 154  |
| 4   | 335                | $6.220 \cdot 10^8$  | 1.85                | $68 \times$  | 41654 | 2060 | 952  | 154  |
| 6   | 288                | $4.147\cdot 10^8$   | 1.44                | 87 	imes     | 62627 | 3090 | 1428 | 154  |
| 9   | 272                | $2.764\cdot 10^8$   | 1.02                | $123 \times$ | 94874 | 4635 | 2142 | 154  |

- CLB, DSP, BRAM increase linearly
- URAM only stores final cost-to-go map
- Vehicle model dominates resource usage and  $f_{\mathrm{max}}$

### **Conclusions & Contributions**

- Scalable architecture for hybrid electric vehicle optimization
- Enabling **real-time** use
- $> 100 \times speedup$  over CPU implementation

frans.skarman@liu.se

