Heaps and Priority Queues

[ Last Updated: 2024-05-11 ]

Introduction: Since last week's look into Dijkstra's optimization involved heaps, I decided to dive deeper into data structures this week, starting with the Heap.

What is a Heap? Why use it?

A Heap is a tree-based data structure typically used to provide fast access to the maximum or minimum value in a collection of data. Heaps are usually constructed as Complete Binary Trees. A valid heap satisfies one of the following properties:

  • Min-Heap: The value of each node is less than or equal to the values of its children.
  • Max-Heap: The value of each node is greater than or equal to the values of its children.

Although we visualize a heap as a tree, its "complete binary tree" property allows it to be stored directly in a standard list. This is because indices in the list correspond directly to positions in the tree:

  • For a node at index ii (starting from 1), its children are at 2i2i and 2i+12i + 1.
  • Its parent is at i/2i/2 (integer division).

Because of this structure, search (for the extreme value), insertion, and deletion can be maintained at O(log2n)O(\log_2 n), which matches the height of the tree. This is a massive improvement over the O(n)O(n) time required to scan a simple list.

Basic Heap Operations

1. Insertion (Insert and Bubble Up) - O(logn)O(\log n)

  1. Insert: Add the new element to the end of the list (the bottom-most, right-most position in the tree).
  2. Bubble Up: Repeatedly swap the element with its parent until the heap property is restored.

Heap Insert

2. Extraction (Extract and Bubble Down) - O(logn)O(\log n)

  1. Extract: Remove the root (min/max value) and replace it with the last element in the heap.
  2. Bubble Down: Repeatedly push this "new root" down by swapping it with its smallest/largest child until the heap property is restored.

Heap Pop

3. Deletion - O(logn)O(\log n)

  1. Remove the node at index xx.
  2. Replace it with the very last element in the heap and perform a bubble down (similar to extraction).

4. Heapify - O(n)O(n)

  1. Throw the entire initial sequence into a list.
  2. Start from the first non-leaf node (n/2n/2) and work upwards, performing bubble down at each node. The total time complexity is i=1log2nin2i=O(n)\sum_{i=1}^{\log_2n} i \cdot \frac{n}{2^i} = O(n).

Note: Finding the extreme value is O(1)O(1)—just peek at the root!


Heapq in Python

Python provides the heapq module to implement min-heap operations effortlessly.

import heapq
import random
 
# Generate 10 random elements and heapify them
h = [random.randint(1, 100) for _ in range(10)]
print("Randomly generated list:", h)
 
heapq.heapify(h)
print("Min-Heap list:", h)
 
# Peek at the minimum
print("Minimum element:", h[0])
 
# Extract the minimum
min_val = heapq.heappop(h)
print("Popped element:", min_val)
print("Remaining heap:", h)
 
# Push a new random number
x = random.randint(1, 100)
heapq.heappush(h, x)
print("Heap after pushing new element:", h)

Note: heapq does not create a new class; it manages a standard list. You can start with an empty list heap = [] and use heapq.heappush(heap, item) to maintain heap properties automatically.


Practical Case: LeetCode 295. Find Median from Data Stream

Problem Description

The median is the middle value in an ordered list. If the size is even, it is the mean of the two middle values. Implement a class to add numbers from a stream and find the median efficiently.

Strategy

This can be solved efficiently using two heaps:

  1. A Max-Heap for the left half (values smaller than the median).

  2. A Min-Heap for the right half (values larger than the median).

Note: Since Python's heapq only supports min-heaps, we store negative values in the max-heap to simulate max-heap behavior.

Logic:

  • If the new number is smaller than the min of the right side, push to the left (max-heap). Otherwise, push to the right (min-heap).

  • Balance: Ensure the size difference between the two heaps is no more than 1.

  • Median: If sizes are unequal, the median is the top of the larger heap. If equal, it's the average of both tops.

Python

import heapq

class MedianFinder(object):
    def __init__(self):
        self.maxheap = [] # Left side (stored as negative)
        self.minheap = [] # Right side

    def addNum(self, num):
        if not self.minheap or num < self.minheap[0]:
            heapq.heappush(self.maxheap, -num)
        else:
            heapq.heappush(self.minheap, num)

        # Rebalance
        if len(self.maxheap) - len(self.minheap) > 1:
            heapq.heappush(self.minheap, -heapq.heappop(self.maxheap))
        elif len(self.minheap) - len(self.maxheap) > 1:
            heapq.heappush(self.maxheap, -heapq.heappop(self.minheap))  

    def findMedian(self):
        if not self.maxheap and not self.minheap:
            return None
        
        if len(self.maxheap) > len(self.minheap):
            return float(-self.maxheap[0])
        elif len(self.minheap) > len(self.maxheap):
            return float(self.minheap[0])
        else:
            return (-self.maxheap[0] + self.minheap[0]) / 2.0

Summary

Heaps are powerful for handling dynamic data and implementing priority queues. Their strength lies in rapid O(1)O(1) access to extremes and efficient O(logn)O(\log n) updates. In the real world, they are used for database maintenance, game event processing, and real-time task scheduling where priorities change constantly.

—Untitled penguin 2024/5/11 0:49