Trading Framework Part I: Tools I Use
Posted by Mike Taylor | Thursday, February 11, 2010
I received a question from a reader regarding the software I use...more specifically...the open source software I use in trading. Instead of a direct response, I figured this type of question might be useful to other readers of this blog.My basic trading framework is the following:
| Operating System: | Windows Vista Home Premium |
| Programming Languages: | Python 2.6.2 & R 2.9.1 |
| Databases: | SQLite 2.4.1, Numpy 1.3.0, & CSV |
| Programming Editor: | SciTE 1.78 |
| Graphing Engines: | Matplotlib 0.98.5 & R |
| GUI: | HTML & JavaScript |
| Scheduler: | Windows Task Scheduler |
| Shells: | Command.com (DOS) & Cygwin (Bash) |
| Historical Quotes: | CSI & Yahoo Finance |
Operating System
Choosing Windows as the operating system is mainly out of convenience. As you can see above, the only real item that would prevent a full move to Linux is the historical quote provider, CSI. Everything else can run on another platform or a suitable alternative is available.
Another reason I've stayed with Windows is due to my current job (windows shop). But, I will admit, I have been very close to switching to a Mac the past few months or possibly OpenSUSE. Just haven't taken the bite yet.
On a side note, prior to my current employer...I worked for a University that was really ahead of its time. Every program we developed had to pass a compatibility test, "Could it easily run on another platform?" While this at times was an impossible task due to user requirements...we still always coded with this compatibility in mind. And I've kept this same philosophy in developing the trading simulation engine.
Programming Languages
I'm originally a Cobol programmer. Yes, that's right...if you've never heard of one...now you're reading a blog by one. Cobol programmers, the good ones, are very keen on whitespace. When you're throwing a lot of code around...the whitespace is what keeps you sane. And so, when I was trying out the various scripting languages back in the day...Python really struck my fancy. I spent the better part of 9 years trying to force programmers to keep the code pretty in Cobol. Only to see Python come around and truly force programmers to code clean. Over the years, I have worked in various other languages, but I've always stuck with Python.
I think another reason I chose Python was due to WealthLab's Scripting language (Pascal-based). I felt I could build an environment similar to WealthLab that would offer the same scripting ease. So far, Python has done a great job in keeping the framework simple and extensible.
Another language I have used from time to time in my trading is R. I use R mainly to analyze trading results. A few years ago, I actually developed a prototype of the trading simulation engine in R. But, it was too slow. The loops killed it. With the recent development of Revolution Computing's ParallelR...I've often wondered what the results would now be. But, I'm past the point of return with the engine in Python. But, as far as fast analysis of CSV files...it is really hard to beat R.
Databases
I struggled several years with how to store and retrieve the historical price series data for the trading simulation engine. The main problem was the data could not fit into memory yet access had to be extremely fast. So, for years I used plain CSV files to store the data.
Basically, reading the CSV files from CSI and writing out new price CSV files with my fixes from possible bad data along with additional calculated fields. At first I stored the data into 1 big CSV file. Then used either the DOS sort or Bash sort command to sort the file by date. I was afraid I would run into file size limits (at the time I was on Windows XP 32-bit). So, I started writing the data out to thousands of files broken down by date. Basically, each file was a date containing all the prices for that date. Worked really well...except analysis on the backend became difficult. Plus, it felt kludgy.
I had always tried to use regular databases for the pricing backend...but they couldn't handle the storage and retrieval rates I required. Just too slow. And yes, I tried all of them: MySQL, PostGreSQL, Firebird, Berkely DB, SQLite, etc.
It wasn't until I read an article by Bret Taylor covering how FriendFeed uses MySQL that I had an idea as to how to use a database to get the best of both worlds - fast storage & retrieval along with slick and easy access to the data. That's when I went back to SQLite and began a massive hacking of code while on a Texas Hill Country vacation. Really bumped the trading simulation engine to another level. The trick to fast storage & retrieval? Use less but bigger rows.
For a memory database? I use numpy. It's a fantastic in-memory multi-dimensional storage tool. I dump the price series from SQLite to numpy to enable row or column-wise retrieval. Only recently have I found the performance hit is a little too much. So, I've removed numpy from one side of the framework. And contemplating removing it from the other side as well. It takes more work to replicate numpy via a dictionary of dictionaries of lists. But, surprisingly, it is worth the effort when dealing with price series. Which means, I may not use numpy in the engine for long. Still a great tool to use for in-memory storage.
Editor, Schedulers, and Shells.
SciTE is hands down my favorite Python editor. I don't like the fancy IDE type stuff. SciTE keeps it simple.
Windows Task Scheduler is for the birds. I should know...my main job is centered around Enterprise Scheduling. But, the windows task scheduler gets the job done most of the time. I just have to code around a lot of the times it misses or doesn't get things quite right. Which is okay...that's life. That's one of the main reasons I have thought about switching to a nix box for cron and the like.
The DOS shell or Bash shell...I don't get too fancy in either. I do use the Bash shell quite a bit in performing global changes in the python code. Or back when the database was CSV based. Again, nix boxes win here. But, us windows developers hopefully can always get a copy of Cygwin to save the day.
Historical Quotes
I have used CSIdata for many years. Mainly for the following reasons:
- Dividend-adjusted quotes which are essential if analyzing long-term trading systems.
- Adjusted closing price - needed if you wish to test the exclusion of data based on the actual price traded - not the split-adjusted price.
- CSV files - CSI does a great job of building and maintaining CSV files of price history.
- Delisted data - I thought this would be a bigger deal but didn't really impact test results...but still nice to have for confirmation.
- Data is used by several hedge funds and web sites such as Yahoo Finance.
Labels: numpy, programming, python, R, trading
What I'm Researching...
Posted by Mike Taylor | Friday, November 14, 2008
| Posted: 13 Nov 2008 01:00 PM CST great summaries on the classic rexx functions. |
| Posted: 13 Nov 2008 12:53 PM CST Joel on Software's Real World. A must see! |
| Reading List: Fog Creek Software Management Training Program - Joel on Software Posted: 13 Nov 2008 12:50 PM CST great reading list! |
In Python how do I sort a list of dictionaries by values of the dictionary? - Stack Overflow
Posted: 09 Nov 2008 09:29 PM CST
| AT&T Labs Research - Yoix / YWAIT Posted: 07 Nov 2008 07:36 AM CST Interesting way to build a web application. Wonder how complex this would be to use versus traditional web-based systems (LAMP)? This may be easier to deploy if the goal of the software is simulation/visualizations. Something to toy with. |
| AT&T Labs Research - Yoix / Byzgraf Posted: 07 Nov 2008 07:33 AM CST Another great looking toolset using Yoix that enables plotting functions: line, bar, histograms, etc. |
| AT&T Labs Research - Yoix / YDAT Posted: 07 Nov 2008 07:32 AM CST Extremely cool visualization toolset from AT&T Labs Research. Handles graphviz files. |
Labels: graphviz, programming, python, rexx
What I'm Researching...
Posted by Mike Taylor | Friday, November 07, 2008
| Overview of RAMFS and TMPFS on Linux Posted: 06 Nov 2008 11:02 PM CST Map your memory as a drive? Wonder how this would work if you built a linux server with 32gb memory and mapped at least half that dedicated for simulations? How much faster would this be versus traditional disk-based sims? |
| Replacing multiple occurrences in nested arrays - Stack Overflow Posted: 06 Nov 2008 10:58 PM CST will this work in updating a dictionary of prices? if you have a dictionary of portfolio positions with values being python lists...would this be a good solution in updating the closing price of the stock (one of the items in the list)? |
Labels: hardware, linux, python
What I'm Researching...
Posted by Mike Taylor | Monday, October 20, 2008
| Posted: 20 Oct 2008 12:17 AM CDT extremely cool application dock for windows. |
| Python Programming/Lists - Wikibooks, collection of open-content textbooks Posted: 20 Oct 2008 12:12 AM CDT Great collection of python list examples. |
| Introduction To New-Style Classes in Python Posted: 19 Oct 2008 01:18 AM CDT great explanation of python classes. check out the final part discussing the __slots__ feature. basically, reserve attributes...those not defined cannot be assigned. |
| Posted: 18 Oct 2008 12:30 PM CDT html version of the pytables userguide. |
| rdoc:graphics:barplot [R Wiki] Posted: 17 Oct 2008 04:22 PM CDT R doc for barplot |
| Welcome to DrQueue Commercial Website Posted: 12 Oct 2008 11:44 PM CDT queue manager with python binding. looks to be used as a render manager...but could see other uses as well. |
| Building home linux render cluster Posted: 12 Oct 2008 11:30 PM CDT excellent article on building a cheap 24 core x 48GB ram linux cluster. |
Labels: cluster, links, pytables, python, R, tools
What I'm Researching...
Posted by Mike Taylor | Wednesday, October 08, 2008
| Linus' blog: .. so I got one of the new Intel SSD's Posted: 07 Oct 2008 10:02 PM CDT great analysis on evaluating SSD hard drives. read the comments for more info. as an aside...linus has a blog...cool. |
| Posted: 07 Oct 2008 12:45 PM CDT monte carlo in python? looks worth exploring further. |
Labels: hardware, links, python
What I'm Researching...
Posted by Mike Taylor | Tuesday, October 07, 2008
| The Sect of Homokaasu - The Rasterbator Posted: 07 Oct 2008 01:45 AM CDT Cool, print huge posters from normal paper - software breaks up images to fit on 8.5 x 11 paper. Hat-tip to my wife for finding this site. |
| Posted: 06 Oct 2008 12:43 PM CDT Great site covering formulas of investment stats. Useful for coding the performance part of the testing platform. |
| pickle(cPickle) vs numpy tofile/fromfile - Python - Snipplr Posted: 05 Oct 2008 11:09 PM CDT interesting code snippet comparing performance of cpickle and numpy to/from file routines. been thinking about this lately...using numpy directly or cpickle instead of using a bloated dbms for persistent storage of time series on the testing platform. |
| HintsForSQLUsers - Hierarchical Datasets in Python Posted: 05 Oct 2008 11:06 PM CDT covers many of the faq of SQL developers when developing with PyTables. |
| EasyvizDocumentation - scitools - Google Code - Easyviz Documentation Posted: 05 Oct 2008 09:55 PM CDT Python plotting interface to various backend plotting engines: Gnuplot, Matplotlib, Grace, Veusz, PyX, VTK, VisIt, OpenDX, and a few more. Seems like a fairly straight-forward interface. And choosing the backend used is a one-line import statement. Interesting. |
| Posted: 05 Oct 2008 12:25 PM CDT looks like a dead-simple plotting library in python to produce pub quality pdf/ps images. Need to explore. |
Labels: investing, links, plotting, pytables, python
What I'm Researching...
Posted by Mike Taylor | Sunday, October 05, 2008
| Posted: 05 Oct 2008 12:12 AM CDT WYSIWYG Javascript WYSIWYG editor - haven't tried it...but may be worth testing on a new project of mine. |
| PyTables - Hierarchical Datasets in Python Posted: 04 Oct 2008 01:35 PM CDT the original python interface to the HDF5 library. Have tested this before...need to test again using new architecture. Original tests found speeds that were equivalent to SQLite but of course slower than CSV files. |
| Python bindings for the HDF5 library — h5py v0.3.1 documentation Posted: 04 Oct 2008 01:33 PM CDT a python interface to the excellent HDF5 library. worth testing in project. |
| Posted: 04 Oct 2008 12:24 PM CDT enjoyed reading this guy's take on Erlang. Of course, he had me with quoting Unix philosophy, "Do one thing and do it well." |
| Optimal RAID setup for SQL server - Stack Overflow Posted: 04 Oct 2008 10:35 AM CDT Excellent Q&A on choosing the optimal RAID config for disk i/o performance. By the by, stackoverflow is an awesome site for programmers!!! |
Labels: erlang, HDF5, javascript, links, python, raid, WYSIWYG
Recent Links for 09/21/2007
Posted by Mike Taylor | Friday, September 21, 2007
Newbie - converting csv files to arrays in NumPyGreat message thread on how to convert csv files to numpy arrays. |
Cookbook/InputOutput - Numpy and ScipyFile processing examples using numpy, scipy, and matplotlib. How to read/write a numpy array from/to ascii/binary files. |
Numpy Example ListExamples of Numpy functions such as fromfile(), hsplit(), recarray(), shuffle(), sort(), split(), sqrt(), std(), tofile(), unique(), var(), vsplit(), where(), zeros(), empty(), and many more. |
Introducing Plists: An Erlang module for doing list operations in parallelCould you spawn a trading system process for each stock of a given day's trading (a list)? What if you had 20,000 stocks for a given day? Can plists/erlang handle 20,000 processes without hitting memory constraints? |
Labels: erlang, links, numpy, python
Recent Links for 09/18/2007
Posted by Mike Taylor | Tuesday, September 18, 2007
| Chapter 22. Struct and Array Modules Overview of the python struct and array modules |
Building Skills in Programming Nice python tutorial. |
| Python Grimoire Nice python cookbook. |
Labels: python
Recent Links for 09/17/2007
Posted by Mike Taylor | Monday, September 17, 2007
Labels: investing, links, numpy, python
Recent Links for 09/15/2007
Posted by Mike Taylor | Sunday, September 16, 2007
Links for 2007-09-15 [del.icio.us]Posted: 16 Sep 2007 12:00 AM CDT
- Practical Common Lisp
Excellent way to get started with Common Lisp. - 9 Things You Simply Must Do
Friend of mine sent me this great post on Dr. Cloud's 9 principles commonly practiced by successful people. My favorites? - Principle #2: Pull the Tooth - face your fears...don't put off today what you can do today.
- Principle #4: Do Something
- ONLamp.com -- An Introduction to Erlang
Great coverage of the Erlang language. - Python Cheat Sheet
Simple little python cheat sheet.
Labels: erlang, links, lisp, python, success
Recent Links 09/05/2007
Posted by Mike Taylor | Wednesday, September 05, 2007
Speed up R, Python, and MATLAB - Going Parallel
Labels: programming, python, rlanguage
Recent Links 09/04/2007
Posted by Mike Taylor | Tuesday, September 04, 2007
World Beta - Engineering Targeted Returns and Risk: More On The Endowment Style Of Investing Annotated
- World Beta shares some links covering the endowment investing side of things...
- A link to Frontier Capital Management- check out their knowledge section for more great papers similar to the ones Faber links to.
- Faber mentions a great upcoming book covering the twelve top endowment CIO's .
- from Alpha Magazine...Highbridge Capital Managment shares its office organization - putting traders and developers together. I've always thought this would be a great idea in any shop. By putting users and developers together - manual taks can be seen and automation can happen.
- A link to
- Great little file compare utility. Graphic front end to the diff program.
note: tested this today against a large file/program (well, not that large in my line of work...but I guess to Google's)...couldn't handle it. But, works great on small files.
- post by taylortree
Google Mondrian: web-based code review and storage
- Online code review that works like a blog/wiki. I wonder...is it possible to create a code review system similar to Mondrian within a source management toolset such as subversion? Seems like most of the backend is there already...would only need to add some front end tools to display the changes being committed and allow comments on those changes.
- post by taylortree
Recent Links 09/03/2007
Posted by Mike Taylor | Monday, September 03, 2007
ONLamp.com -- Numerical Python Basics
- Numpy basics.
- post by taylortree
Finding Duplicate Elements in an Array :: Phil! Gregory Annotated
- Interesting way to find duplicates in an array. Enjoyed the links on the pigeonhold principle and Floyd's cycle-finding algorithm.
- post by taylortree
integers less than n. We can be sure (by the pigeonhole principle)
that there is at least one duplicate.
use Floyd's cycle-finding algorithm. It works roughly like this:
Start at the beginning of the sequence. Keep track of two values (call
them ai and aj). At
each step of the algorithm, move ai one step
along the sequence, but move aj two steps. Stop
when ai = aj.
Labels: programming, python
Reduce runtime of Python, R, and MATLAB applications by 85%? Process 10-100X larger datasets? With just a few code changes? Not quite sure how...but something to explore in the future. Their success story on speeding up MATLAB code for Monte Carlo Analysis looks pretty easy of a code change to me. Read their blog for further insights into HPC...
- Multicore: Why all the Hubbub?
- What is "Productivity" in High Performance Computing?
- post by taylortree