Showing posts with label numpy. Show all posts
Showing posts with label numpy. Show all posts

Thursday, February 11, 2010

Trading Framework Part I: Tools I Use

I received a question from a reader regarding the software I use...more specifically...the open source software I use in trading. Instead of a direct response, I figured this type of question might be useful to other readers of this blog.

My basic trading framework is the following:
Operating System:Windows Vista Home Premium
Programming Languages:Python 2.6.2 & R 2.9.1
Databases:SQLite 2.4.1, Numpy 1.3.0, & CSV
Programming Editor:SciTE 1.78
Graphing Engines:Matplotlib 0.98.5 & R
GUI:HTML & JavaScript
Scheduler:Windows Task Scheduler (DOS) & Cygwin (Bash)
Historical Quotes:CSI & Yahoo Finance

Operating System
Choosing Windows as the operating system is mainly out of convenience. As you can see above, the only real item that would prevent a full move to Linux is the historical quote provider, CSI. Everything else can run on another platform or a suitable alternative is available.

Another reason I've stayed with Windows is due to my current job (windows shop). But, I will admit, I have been very close to switching to a Mac the past few months or possibly OpenSUSE. Just haven't taken the bite yet.

On a side note, prior to my current employer...I worked for a University that was really ahead of its time. Every program we developed had to pass a compatibility test, "Could it easily run on another platform?" While this at times was an impossible task due to user requirements...we still always coded with this compatibility in mind. And I've kept this same philosophy in developing the trading simulation engine.

Programming Languages
I'm originally a Cobol programmer. Yes, that's right...if you've never heard of you're reading a blog by one. Cobol programmers, the good ones, are very keen on whitespace. When you're throwing a lot of code around...the whitespace is what keeps you sane. And so, when I was trying out the various scripting languages back in the day...Python really struck my fancy. I spent the better part of 9 years trying to force programmers to keep the code pretty in Cobol. Only to see Python come around and truly force programmers to code clean. Over the years, I have worked in various other languages, but I've always stuck with Python.

I think another reason I chose Python was due to WealthLab's Scripting language (Pascal-based). I felt I could build an environment similar to WealthLab that would offer the same scripting ease. So far, Python has done a great job in keeping the framework simple and extensible.

Another language I have used from time to time in my trading is R. I use R mainly to analyze trading results. A few years ago, I actually developed a prototype of the trading simulation engine in R. But, it was too slow. The loops killed it. With the recent development of Revolution Computing's ParallelR...I've often wondered what the results would now be. But, I'm past the point of return with the engine in Python. But, as far as fast analysis of CSV is really hard to beat R.

I struggled several years with how to store and retrieve the historical price series data for the trading simulation engine. The main problem was the data could not fit into memory yet access had to be extremely fast. So, for years I used plain CSV files to store the data.

Basically, reading the CSV files from CSI and writing out new price CSV files with my fixes from possible bad data along with additional calculated fields. At first I stored the data into 1 big CSV file. Then used either the DOS sort or Bash sort command to sort the file by date. I was afraid I would run into file size limits (at the time I was on Windows XP 32-bit). So, I started writing the data out to thousands of files broken down by date. Basically, each file was a date containing all the prices for that date. Worked really well...except analysis on the backend became difficult. Plus, it felt kludgy.

I had always tried to use regular databases for the pricing backend...but they couldn't handle the storage and retrieval rates I required. Just too slow. And yes, I tried all of them: MySQL, PostGreSQL, Firebird, Berkely DB, SQLite, etc.

It wasn't until I read an article by Bret Taylor covering how FriendFeed uses MySQL that I had an idea as to how to use a database to get the best of both worlds - fast storage & retrieval along with slick and easy access to the data. That's when I went back to SQLite and began a massive hacking of code while on a Texas Hill Country vacation. Really bumped the trading simulation engine to another level. The trick to fast storage & retrieval? Use less but bigger rows.

For a memory database? I use numpy. It's a fantastic in-memory multi-dimensional storage tool. I dump the price series from SQLite to numpy to enable row or column-wise retrieval. Only recently have I found the performance hit is a little too much. So, I've removed numpy from one side of the framework. And contemplating removing it from the other side as well. It takes more work to replicate numpy via a dictionary of dictionaries of lists. But, surprisingly, it is worth the effort when dealing with price series. Which means, I may not use numpy in the engine for long. Still a great tool to use for in-memory storage.

Graphing Engines and GUI
I really try to keep it simple in the front-end of the trading framework. I use Matplotlib to visualize price or trading results. And HTML along with Javascript to display trading statistics. Honestly, not a lot has gone into this side of things. Still very raw. My goal for 2010 is to work more in this area.

I have used R quite a bit in analyzing the output of the trading backtests. R is really powerful here. Quickly and easily chart and/or view pretty much any subset of the data you wish.

If there's certain items I look at over and over in the backtests...I'll typically replicate in Python & Matplotlib and include in the backtest results.

Editor, Schedulers, and Shells.
SciTE is hands down my favorite Python editor. I don't like the fancy IDE type stuff. SciTE keeps it simple.

Windows Task Scheduler is for the birds. I should main job is centered around Enterprise Scheduling. But, the windows task scheduler gets the job done most of the time. I just have to code around a lot of the times it misses or doesn't get things quite right. Which is okay...that's life. That's one of the main reasons I have thought about switching to a nix box for cron and the like.

The DOS shell or Bash shell...I don't get too fancy in either. I do use the Bash shell quite a bit in performing global changes in the python code. Or back when the database was CSV based. Again, nix boxes win here. But, us windows developers hopefully can always get a copy of Cygwin to save the day.

Historical Quotes
I have used CSIdata for many years. Mainly for the following reasons:
  • Dividend-adjusted quotes which are essential if analyzing long-term trading systems.
  • Adjusted closing price - needed if you wish to test the exclusion of data based on the actual price traded - not the split-adjusted price.
  • CSV files - CSI does a great job of building and maintaining CSV files of price history.
  • Delisted data - I thought this would be a bigger deal but didn't really impact test results...but still nice to have for confirmation.
  • Data is used by several hedge funds and web sites such as Yahoo Finance.
The only drawback I have to CSI is the daily limit to the number of stocks you can export out of the product. It can get frustrating trying to work around the limit. Of course, you can always pony up for a higher limit.

This covers Part I of the series. Next up? The type of analysis I perform with the trading framework.

Later Trades,


Friday, September 21, 2007

Recent Links for 09/21/2007

Newbie - converting csv files to arrays in NumPy
Great message thread on how to convert csv files to numpy arrays.
Cookbook/InputOutput - Numpy and Scipy
File processing examples using numpy, scipy, and matplotlib. How to read/write a numpy array from/to ascii/binary files.
Numpy Example List
Examples of Numpy functions such as fromfile(), hsplit(), recarray(), shuffle(), sort(), split(), sqrt(), std(), tofile(), unique(), var(), vsplit(), where(), zeros(), empty(), and many more.
Introducing Plists: An Erlang module for doing list operations in parallel
Could you spawn a trading system process for each stock of a given day's trading (a list)? What if you had 20,000 stocks for a given day? Can plists/erlang handle 20,000 processes without hitting memory constraints?

Monday, September 17, 2007

Recent Links for 09/17/2007