Lately, I've needed a way to collect a running list of the top X values and their associated items in Python. This is useful if you'd like to know such things as:
- Top 100 price gainers in your price series database;
- Top 10 most volatile stocks in your price series database;
- Top 5 longest running batch jobs in your operations arena;
- Any many more...
Here's the MaxItems code to do the job:
import heapq
class MaxItems(object):
"""
Caches the max X items of whatever you're analyzing and
returns a list containing only those max X values and
associated items.
"""
def __init__(self, size):
self.size = size
self._recs = []
def push(self, value, items):
if len(self._recs) < self.size:
heapq.heappush(self._recs, (value, items))
else:
minitem = heapq.heappushpop(self._recs, (value, items))
def items(self):
return heapq.nlargest(self.size, self._recs)
Example call and results:
pricegains = []
pricegains.append({'symbol':'GOOG', 'gain':234.0})
pricegains.append({'symbol':'YHOO', 'gain':124.0})
pricegains.append({'symbol':'IBM', 'gain':1242.0})
pricegains.append({'symbol':'GE', 'gain':1800.0})
pricegains.append({'symbol':'ENE', 'gain':0.0})
pricegains.append({'symbol':'KREM', 'gain':12.0})
maxitems = MaxItems(3)
for row in pricegains:
maxitems.push(row['gain'], row)
print maxitems.items()
----------------------------------------------------------
Results of call:
(1800.0, {'symbol': 'GE', 'gain': 1800.0})
(1242.0, {'symbol': 'IBM', 'gain': 1242.0})
(234.0, {'symbol': 'GOOG', 'gain': 234.0})
The
heapq module works nicely in accomplishing the task. What's ironic is Python's heapq module implements the min-heap algorithm which works out nicely and efficiently in determining the maximum values over a list. But, does not work out so efficiently for determining the minimum values over a list.
I'll cover the MinItems class in another post. But, to give you a hint of what does work in collecting the minimum values over a list is one of the alternatives I explored in building the MaxItems class...
Alternative yet Inefficient version of MaxItems:
import bisect
class MaxItems2(object):
"""
Caches the max X items of whatever you're analyzing and
returns a list containing only those max X values and
associated items.
"""
def __init__(self, size):
self.size = size
self._recs = []
def push(self, value, items):
if len(self._recs) < self.size:
bisect.insort(self._recs, (value, items))
elif bisect.bisect(self._recs, (value, items)) > self.size:
bisect.insort(self._recs, (value, items))
minitem = self._recs.pop(0)
def items(self):
return sorted(self._recs, reverse=True)
MaxItems2 takes advantage of the
bisect module and while it works great; performance is at a minimum 2x worse on average than using the heapq method.
Test Code:
import random
pricegains = []
maxitems = MaxItems(100)
for x in xrange(500000):
gain = random.uniform(1.0,500.0)
maxitems.push(gain, ('This', 'is', 'Record'))
rows = maxitems.items()
Calling the above code with the wonderful
timeit module produced the following results:
- heapq method: Ten iterations finished in 1.90 seconds.
- bisect method: Ten iterations finished in 3.80 seconds.
If you know of a faster way to collect the top x of a group of items...please share.
MT