I had in mind, starting a forum for those of us that wish to code from scratch, for the purpose of creating our own custom trading tools, such as portfolio able backtest engines, forward (real time) trade management, simulators (monte Carlo and such), etc.

Currently, I am learning VB.net, and I am working on my Database design. I am using Visual Studio.net, VB.net, and SQL server 2k.

Last edited by Karakoram on Fri Jan 09, 2004 3:37 pm, edited 1 time in total.

I wanted to elicit an exchange of information between traders, that are programming/learning to program.

I picked VB.net, and SQL server is my tools to start with.

I could use help right now designing my DB. I have a priminary design, but I am basically a beginning programmer, and do not have much experience to draw upon.

I need to figure out how to manage market data, perform backtests, etc. My plan is to not worry about a scripting language (such as TradeStation or Trading Recipes), since I will have access to the source code. Therefore, I will have NO limitations on testing ideas I want to test, portfolio, single market, position sizing algorthims, etc.

I have attached a screen shot of my spreadsheet that is a prototype DB design. Please understand that I am not using this particular spreadsheet, its just a "drawing board" for my DB design.

Here is what I have spec'ed out so far, unfortunately, some stuff is proprietary, so I have deleted that stuff:

Requirements:
The application should import data from the CSI database into the applicationâ€™s associated SQL server DB
The application should allow historical backtesting of various systematic trading strategies, over a portofolio, and display the results. Parameters may be changed, checking for optimization, anti-optimization, etc. Results of backtests should be displayed in graphical format, and saved.

The application (or a separate, but related application) should allow for trading the same strategy in real-time, going forward. Allowances for corrections, withdrawals, and additions, of portoflio equity will be necessary. After every bar interval (initially, this will be daily), a report will be generated that will show all orders that will need to be sent to the broker. Possibly an interface will be built that automates sending orders to the broker. In addition, a panic stop should be incorporated that would generate the necessary orders for exiting all trades. Thus, a restart feature will be necessary as well, that will resume trading from the timing signals that would be been generated, if the system had not been paused.

Future Support for Intra-Day Order Execution:
To support Intra-Day order execution, the application will need to interface with brokerâ€™s order entry systems, to enable automatic execution of orders. In addition, it will need to generate order execution summary sheets (which may be faxed or e-mailed to the broker).

Initially, the application will perform backtests of trading strategies, such as trade-all-markets-in-portfolio with one system, multiple systems, and multiple copies of the same-systems with different time frames. Various statistics will be generated and stored after each test.

TimePeriod = (EndDateTime â€“ StartDateTime). Time period is the descreet amount of time that makes up a bar (see below). A time period may be 1 second, 20 seconds, 3 hours, 1 day, and so on. The smallest time period that may be chosen is 1 second. The database defines all time periods in terms of half seconds. A half second was chosen to allow a signal to be generated before a time period has passed (a signal may be generated mid-bar).

Bar â€” graphical representation of the Open, High, Low and Close of prices for any given TimePeriod. Note that the use of â€œOpenâ€

I ask again the same question: Why do you need a database in the first place?

Is this because the other tools you use _need_ a database/SQL and all that, or do _you_ have the need for a database?

For example, in what you describe, I don't seen a "portfolio" as such nor anything related to instrument selection. Why do you think you have covered "all the grounds" with what you suggest?

It seems to me that you are focusing on one area, while forgetting essential elements in your design first. It's probably better to go in concentric loops than straight to the core, but missing half the requirements! ...

Why a database? (did I ask already? OK, then, one more time: WHY A DATABASE?).

I had in mind, starting a forum for those of us that wish to code from scratch, for the purpose of creating our own custom trading tools, such as portfolio able backtest engines, forward (real time) trade management, simulators (monte Carlo and such), etc.

Chris, if you get enough interest. I can create a user group here, make you a moderator of this group, and add a special members only forum area that you can use.

Now putting my programming and database design hat on, I'll make some suggestions.

First, putting the data in a database is okay from a certain perspective but you are going to want to run tests out of a more accessible format. There are simply too many rows of data to place them in a database and expect reasonable retrieval rates. I'd put references to the data or perhaps place the data into BLOB columns but you'll be sorry if you put each day into its own row.

Also as a general rule, don't put anything in a database that you can put in memory, unless you need that information after the program has quit running.

Third, the scope of what you envision is very large, several years worth of work for an experienced programmer.

Try to break it up into many (at least five or six) stages, each of which should take no more than one or two months work start to finish. It is very good to know where you want to go in the grand scheme when you do a design, but you will find that incremental staging is much easier to achieve, much better on your sanity, and much more likely to result in something that works.

The process of prioritization one is forced to go through when designing the different stages is extremely valuable.

What verec is getting at, is that you should have a set of user requirements that drive your designs. A set of specific capabilities that cause you to make certain design decisions.

An example might be:

Need to be able to search through all previous tests by date, results (i.e. MAR, return, etc.), and system

Need to be able to list all the historical positions for any given day

etc.

These requirements drive the design. Don't put the cart before the horse. If you are going to have discussions with other programmers, especially more senior ones, they will want to know the reasons behind the features and designs so they can evaluate alternative means to satisfy those requirements.

Commercial software design starts with a set of requirements (often produced by the product marketing department) which specify what the applications need to be able to DO but not how. The programmers/designers determine HOW the program will perform the DO. So another way of asking the question is: What does the program need to be able to DO?

Finally, since the design of a system at a higher level is largely independent of the implementation tools Java, VB, C++, etc. SQL Server via ODBC or JDBC etc., and verec has started a similar effort, see viewtopic.php?t=779&highlight=plugyourown, you would probably benefit from collaborating on design ideas on his project.

On the subject of a Forum-- Forum Mgmnt, thank you for your offer. I had hoped that something would be started here, in the TTS forum. I guess we should see what kind of interest there is.

Programming approach:
Verac, I took a look at your site, as c.f. posted the link to your discussion, (and I was unaware of it until now). I also took a look at the Extreme programming.org site. I agree that the incremental approach would be the best way to approach this.
I should also qualify everything I say with this: I have some (old) programming experience, but I am not a programmer. I am learning VB.net from the ground up. Like you, I found the existing trading software packages too limited in capabilities. I was advised by a Market Wizard that the custom programming solution was the way to go. I resisted this idea because of what was involved with learning programming. Finally, I decided that the benifits outweighed the cost, and for the past 6 months, have been learning a real programming language (as compared to Sneazy Language).

Why SQL server:
I have a couple of friends that are programmers full time (but at this time, are not interested in trading). When I can pin them down, I get some advice on the design of my software. Since I am the only intended end-user of this software, I think part of the basic problem here is that much of my requirements are in my head. I can picture what I want the software to do. I will go ahead and get it all down on paper (or Word).

From my discussions with my programming friends, telling them about the amount of data, and what I wanted to do with it, they advised me to use SQL server. They told me that data retrieval performance would not be a problem (as it is with Access), since SQL server can retreive just the data I am querying for. In addition, its available to me as a tool, and from what I understand, its a robust tool. I also have several books that cover VB.net with SQL server. I asked them about retreiving data via a text ASCII file, but they advised against this approach.

Its been so long since I decided to use SQL server, I had forgotton some of the reasons. I went back and looked at my email discussions with the trader I know--Here is a summary:

Backtests end up being too large to put it all into memory
Save the results of a run
Backtests run from a DB vs. Memory may be slower, but will not take an amount of time as to be cumbersome.
Also, I need to save some results after each time period, for example, a stock sorting method (that, after each day, scans a universe of stocks for criteria, generates a list, sorts and stores it for each day).

Given the advice from Forum Mgmnt, in his last post, I will talk to my programmer friends some more.

Why VB.net:
I pondered the question of "which language?" for a long time. I asked the avice of my programmer friends, and the successful trader that I know. In the end I decided to go with VB.net. However, its my understanding that when using the .net framework, deciding on a langage is more like, "whats my favorite color?".

One advantage to VB.net over some of the other languages is that VB programmers are more common, and are more cost-efficient to hire.
Another is that it is easier to debug than C++, and less frusterating for a beginner, and easier to maintain with a smaller team.

Extreme Programming Approach:
One of my programming friends, taught me about the "incremental programming approach" which when he described it, I think is the same thing as Extreme Programming.

I will rework my requirements, in the spirit of User-Stories. Once I have these, I will post them here, and perhaps it will make it clearer what I am trying to accomplish. In addition, in the spirit of the incremental programming approach, I will break up the project as c.f. suggested.

I posted my DB design earlier, because I wanted help with designing it to support future features, such as real-time trading, on an intra-day basis. However, my current requirement in this respect is end-of-day (and larger period) data handling.

Last edited by Karakoram on Thu Jan 08, 2004 2:37 pm, edited 1 time in total.

I have no idea how efficient all the layers between the SQL database and your VB code are but I suspect you will be 10 to 20 times slower with SQL Server. If they do a great job you'll be 2 to 3 times slower. You could easily be 100 times slower or more based on other similar tests I have done in the past.

Disk speeds have not progressed as fast as memory or processor speeds, and you will often find that databases read in small chunks when your own code could easily read an entire market file in one single, fast and efficient disk read.

Disk rotation is measured in milliseconds, typically 4ms to 6ms for faster disks. That's really slow in computer time. Consider that a 3 Ghz machine can execute as much as 1,200,000 instructions in 4ms. So this means that database access of a lot of information is generally much, much slower than direct disk access.

With regard to data retrieval speed from a DB, a common approach when deciding what tools are appropriate is to model your actual data and then run some tests.

Take 25 years of data from a TEXT file database and put it into SQL server. Read all the rows, see how long this takes.

Now make 100 copies of that data and do the same to emulate having 100 markets worth of data. Read all the rows, see how long this takes.

Compare this to reading an equivalent amount of binary data. Is the speed difference something you can live with?

Now multiply that all by 80 to emulate the amount of data in 5 minute bars, read all the rows and see how long it takes. Compare that to reading binary format. Compare the space requirements. You are the one who will have to live with the performance, so if it meets your needs, then you are good to go.

This shouldn't take more than a few hours once you have code that can read sample data and put the rows into the database. You probably already have this done.

Why SQL server:
I have a couple of friends that are programmers full time

There's something that is obvious to you but that is not necessarily obvious to them, which is that your CSI acquired data can be considered read-only, because, once you have imported it, in whatever format is appropriate, you won't be modifying it ever again. (let's ignore the issue of CSI, once in a while issuing a correction to some price; this can largely be dealt with out-of-band)

This means that SQL or not, most databases are optimized to allow UPDATES to their data set to be reasonably fast. But since you're never going to update it anyway, the trade-off can be seen under another light, and then optimizing read access time will probably drag you out of the database realm, precisely because you want fast access.

Also, another reason against the use of databases or SQL is that nowhere in your requirements did I see, so far, that you would be willing to run queries like "give me the top 10 instruments whose close price is between 123 and 456". This would make little sense, and for good reason.

...So if you don't need either a query language, nor a database ... what about a plain old flat file? It's reasonably efficient, doesn't waste storage, you can always split your data accross many files in case your files were to become too big ...

Sure, you can go the schema route and imagine all sorts of intellectually interesting things, but you would probably be over designing.

Karakoram wrote:At this point, I would like to encourage an exchange of classes and functions.

That's an excellent attitude! Unfortunately, regarding myself, I don't grok either of VB or .net. I'm sure they are very fine and powerful languages and environments, but I don't have the time to invest in Windows only technologies. That was one of my reasons to choose Java in the first place, because it runs on Windows, Mac, Linux ...

I wish you the best of luck and hope you'll find other people talking VB the way you like.

Having dealt with all this for years, raw binary data files rule when it comes to speed.

Reading text files can be reasonably fast, but all that field splitting, qualifying, and parsing to native binary formats does take processing time. (It's most neglible until you start reading heaps of data in.) Of course, text has its advantages in that it is human readable and readily editable in any old text editor, which can be darn handy at times.

Sorted binary files with fixed-length records can be binary-searched to rapidly find quote records with specific dates (or better still use an interpolation search for even greater speed). This really speeds up testing when you only want data starting from say January 1995 in a file that has data that goes back to 1979. This one has been a significant time saver when I do out-of-scope testing.

This brings up one thing that RDBMSs can have an advantage at. By indexing on the quote date field, you can rapidly jump to specific quotes by date for this sort of out-of-scope testing. Also, all that querying that an RDBMS can do does have benefit, but I've mostly only found this so when I was dealing with more complex historical options data (lots of contracts, strike points, scaling due to splits, etc.)

BTW, the MetaStock/Computrac format is binary, but the floats are stored in the old Microsoft Basic format (predating the IEEE floating point standard), so you still have to do some bit manipulation. Then there's the need to crunch the integer-formatted dates, and so on. No doubt verec's kindly donated code has all the coding details.

Hey verec, all my code is in .NET, so I'd love to start one of those pointless religous language wars with you. Naaaah, just kidding.

Loitering Without Intent wrote:This brings up one thing that RDBMSs can have an advantage at. By indexing on the quote date field, you can rapidly jump to specific quotes by date

.

Certainly, but it can be much faster if you only consider your quotes as being an ordered set, with elements 0 through n. If the ordering is preserved, as is the case if you use a simple array, then the only thing you need is a map (some other array with same length) of quote index to quote date. If that date map contains a binary representation of the date, then a binary search will give you a very fast access to that particular quote index, and from that index, instantaneously to the quote itself.

I would venture that an optimised binary search of such a date map will outperform any RDBMS btree search on some date index.

Loitering Without Intent wrote:Hey verec, all my code is in .NET, so I'd love to start one of those pointless religous language wars with you.

The bulk of the time to read data off the disk is spent waiting for the disk's read head to get to the right spot on the disk.

There are two components to this wait:

1) Moving the head to the right track which is known as seek time
2) Waiting for the disk to spin until the head is over the specific spot on the disk where the file begins

For today's large disks these times are huge as compared with the time to read even a relatively large file once the disk is positioned correctly. For this reason, I suggest reading the entire file at one time.

If you read the file in chunks you will need to wait for a bit more than a full disk rotation until you can read the next chunk.

The disk controller may have read ahead enough of the disk to get your file from it's cache if it uses a good read-ahead caching algorithm, but if it doesn't you will end up slowing your reading down quite a bit by reading in smaller chunks. Some caching algorithms only read a certain amount ahead. So if you read 32K off the disk the read-ahead algorithm might only read another 32K ahead. If the file is 128K you will have to wait another rotation to get the third 32K chunk. This will effectively double the time to read the file versus an approach that reads the entire file in one read.

Remember most of the examples of code from Computer Science were written at a time when the disk versus memory equation was very different than it is now. It wasn't always possible or practical to read entire files into memory. With today's cheap memory (256M machines are very common and 1Gig machines are within reach of most people), it pays to rethink the old algorithms.

All these are excellent points. After collecting up your requirements, and starting your design, you might want to consider commercial or open source libraries to implement some of those requirements. It sure helps whittle down the work. One library that comes to mind is SmartQuant (http://www.smartquant.com). Maybe you'll use it, maybe not. At least you can see an example of how things are put together in a .NET trading package.