Sunday, January 16, 2011

Collecting Max Items in Python

Lately, I've needed a way to collect a running list of the top X values and their associated items in Python. This is useful if you'd like to know such things as:
  • Top 100 price gainers in your price series database;
  • Top 10 most volatile stocks in your price series database;
  • Top 5 longest running batch jobs in your operations arena;
  • Any many more...
Here's the MaxItems code to do the job:
import heapq

class MaxItems(object):
    """
    Caches the max X items of whatever you're analyzing and
    returns a list containing only those max X values and
    associated items.
    """
    def __init__(self, size):
        self.size = size
        self._recs = []

    def push(self, value, items):
        if len(self._recs) < self.size:
            heapq.heappush(self._recs, (value, items))

        else:
            minitem = heapq.heappushpop(self._recs, (value, items))

    def items(self):
        return heapq.nlargest(self.size, self._recs)
Example call and results:
pricegains = []
pricegains.append({'symbol':'GOOG', 'gain':234.0})
pricegains.append({'symbol':'YHOO', 'gain':124.0})
pricegains.append({'symbol':'IBM', 'gain':1242.0})
pricegains.append({'symbol':'GE', 'gain':1800.0})
pricegains.append({'symbol':'ENE', 'gain':0.0})
pricegains.append({'symbol':'KREM', 'gain':12.0})
maxitems = MaxItems(3)

for row in pricegains:
    maxitems.push(row['gain'], row)

print maxitems.items()

----------------------------------------------------------
Results of call:
(1800.0, {'symbol': 'GE', 'gain': 1800.0})
(1242.0, {'symbol': 'IBM', 'gain': 1242.0})
(234.0, {'symbol': 'GOOG', 'gain': 234.0})
The heapq module works nicely in accomplishing the task. What's ironic is Python's heapq module implements the min-heap algorithm which works out nicely and efficiently in determining the maximum values over a list. But, does not work out so efficiently for determining the minimum values over a list.

I'll cover the MinItems class in another post. But, to give you a hint of what does work in collecting the minimum values over a list is one of the alternatives I explored in building the MaxItems class...

Alternative yet Inefficient version of MaxItems:
import bisect

class MaxItems2(object):
    """
    Caches the max X items of whatever you're analyzing and
    returns a list containing only those max X values and
    associated items.
    """
    def __init__(self, size):
        self.size = size
        self._recs = []

    def push(self, value, items):
        if len(self._recs) < self.size:
            bisect.insort(self._recs, (value, items))

        elif bisect.bisect(self._recs, (value, items)) > self.size:
            bisect.insort(self._recs, (value, items))
            minitem = self._recs.pop(0)

    def items(self):
        return sorted(self._recs, reverse=True)
MaxItems2 takes advantage of the bisect module and while it works great; performance is at a minimum 2x worse on average than using the heapq method.
Test Code:
import random

pricegains = []
maxitems = MaxItems(100)
for x in xrange(500000):
    gain = random.uniform(1.0,500.0)
    maxitems.push(gain, ('This', 'is', 'Record'))

rows = maxitems.items()
Calling the above code with the wonderful timeit module produced the following results:
  • heapq method: Ten iterations finished in 1.90 seconds.
  • bisect method: Ten iterations finished in 3.80 seconds.
If you know of a faster way to collect the top x of a group of items...please share.
MT

No comments: