In finance and investing the term portfolio refers to the collection of assets one owns. Compared to just holding a single asset at a time a portfolio has a number of potential benefits. A universe of asset holdings within the portfolio gives a greater access to potentially favorable trends across markets. At the same time the expected risk and return of the overall holding is subject to its specific composition.

A previous post on this blog illustrated a portfolio selection methodology that screens a universe of possible assets. The object there was to find a particular combination that reduces the expected risk-return ratio (sharp-ratio). The assumption there was an underlying buy-and-hold strategy, that once a particular portfolio is selected its composition is hold static until the final position unwound.

This post concerns with dynamic portfolio trading strategies where the portfolio is periodically rebalanced. At prescribed intervals the composition of the portfolio is changed by selling out existing holdings and buying new assets. Typically it is assumed that reallocation does not change the overall portfolio value, with the possible exception of losses due to transaction costs and fees. Another classification concerning this type of strategies is if borrowings are allowed (long-short portfolio) or if all relative portfolio weights are restricted to be positive (long-only portfolio).

Following closely recent publications by Steven Hoi and Bin Li we first set the stage by illustrating fundamental portfolio mechanics and then detailing their passive aggressive mean reversion strategy. We test drive the model in R and test it with 1/2011-11/2012 NYSE and NASDAQ daily stock data.

Portfolio Trading Terminology

Let’s say we want to invest a notional amount of say into a portfolio and record each day the total portfolio value . The portfolio is composed of members with a per share dollar price of . Assuming we hold shares of each component, then the relative portfolio weight for each member is

.

The dollar delta expresses the value of each stock holding. Since the aggregated portfolio value is the sum over its components , the relative portfolio weights are always normalized to

Buy and Hold

An instructive example is to look at ideal perfectly antithetic pair of artificial price paths. Both assumes assets alternate between a 100% and 50% return. A static portfolio would not experience any growth after a full round trip is completed. The column shows the individual relative open to close or close to next day’s close returns.

Dollar Cost Averaging

Dollar cost averaging is a classic strategy that takes advantage of periodic mean reversion. After each close rebalance the portfolio by distributing equal dollar amounts between each component. After the rebalance trades each delta dollar position in the portfolio is the same at the open for each component. Rebalancing itself does not change the total portfolio value. Expressed in terms of the relative portfolio weights dollar cost averaging is the allocation strategy to rebalance the weight back to the uniform value of the inverse number of components inside the portfolio.

Dollar cost averaging:

Perfect Prediction

With the benefit of perfect foresight its is possible to allocate all weights into the single asset with highest predicted gain.

Mean Reversion Strategy

The one period portfolio return is given by the product of relative weights and component relative returns as

The task of the trading strategy is to determine the rebalancing weights for the opening position of day given yesterdays relative returns at the close .
A forecast of a positive trend would be that today’s returns are a repeat of yesterdays . In contrast mean reversion would assume that today’s returns are the inverse of yesterdays .

The following strategy consists of determining the weights such that a positive trend assumption would lead to a loss with threshold parameter .

Considering returns with

Typically the relative performance vector consists of a mix of growing and retracting assets. Then we have a balanced number of positive and negative components within the vector. The condition that the product chooses an weight vector that is perpendicular to and lies on the unit simplex .

Now assume a mean reverting process that consists of a rotation of performances within the asset vector. Graphically the corresponds to a reflection at a degree axis. After rotation the angle between the the weight vector and the new performance vectors reduces. This results in an increase of the portfolio return value.

Thus setting for example generates a flat portfolio in case the performance vector does not change while it increases the portfolio value in the case of an actual mean reversion.

The strategy implicitly assumes a price process that rotates between outperforming and lagging components within the portfolio. This does not include the situation where all stocks are out or underperforming at the same time. The long-only portfolio strategy relies on the ability to be able to reallocate funds between winners and losers.

For the two share example we have

using and

The table below shows the result of computing the portfolio weights at the open of day 2 according to above formulas.

With more than two assets the loss condition together with the normalization is not sufficient to uniquely determine the portfolio weights. The additional degrees of freedoms can be used to allow the posing of additional criteria one wishes the strategy to observe. A good choice seems to be requiring that the new portfolio weights are close to the previous selection, as this should minimize rebalance transaction costs.

Minimizing the squared distance between the weights under the normalization and loss constraint leads to the following expression for the new weight

as a function of the previous portfolio weights , the return vector and the average return across the portfolio components .

The control gain is taken as the ratio of the access loss over the return variance

Simulation

This post contains the R code to test the mean reversion strategy. Before running this the function block in the appendix need to be initialized. The following code runs the artificial pair of stocks as a demonstration of the fundamental strategy

As a benchmark I like always to testing against a portfolio of DJI member stocks using recent 2011 history. The strategy does not do very well for this selection of stocks and recent time frame.

As an enhancement screening stocks by ranking according to their most negative single one-day auto-correlation. The 5 best ranked candidates out of the DJI picked this way are “TRV”,XOM”,”CVX”,PG” and “JPM” .
Rerunning the strategy on the sub portfolio of these five stocks gives a somewhat better performance.

Again, the following code is use to load DJI data from Yahoo and run the strategy

Next a screening a universe of about 5000 NYSE and NASDAQ stocks for those with the best individual auto-correlation within the 2011 time frame. Then running the strategy for 2011 and out of sample in Jan-Nov 2012, with either the 5 or the 100 best names. The constituent portfolio consists of “UBFO”,”TVE”,”GLDC”,”ARI”,”CTBI”.

Discussion

We test drove the passive aggressive trading strategy on recent daily price data. While not doing very well on a mainstream ticker like DJI, pre-screening the stock universe according to a auto-correlation did improve results, even when running the strategy out of sample for the purpose of cross validation. Other extensions, like including a cash asset or allowing for short positions require further investigation.

The strategy has interesting elements, such as minimizing the chance in the portfolio weights and the use of a loss function. It does only directly take the immediate one-period lag into account. For use in other time domains, like high frequency, one could consider having a collection of competing trading agents with a range of time lags that filter out the dominate mean reverting mode.

There are a verity of approaches in the literature that I like to demonstrate as well in the framework of this blog, stay posted.

Let’s look at the task of selecting a portfolio of stocks that optimize a particular measure of performance. In the structured product setting one might want to compose a portfolio to be used as a reference index for a derivative, with the objective that the index needs a specifically high or low correlation structure, since the overall correlation affects the volatility of the final product. Similarly an individual investor might like to select a number of stocks in order to compose a portfolio that historically had a low volatility and a high growth rate. Yet another objective could be to find portfolio with a specific correlation behavior with respect to a particular target.

To show a particular example, let’s try to select a portfolio composed out of four assets out of a larger universe of around 5,000 NYSE and NASDAQ names that would have the highest possible sharp ratio in the year of 2011. The sharp ratio measures the ratio of average daily returns over its standard derivations and is a widely used performance measure.

The attached R code runs the problem against a small universe of DJI stocks, in order not to require uploading of a large number of market data. For this exercise I don’t correct the stock growth for the risk-free rate and consider long only stock combinations.

For this task one must first be able to calculate the relative weights for a given set of four stocks that optimize the sharp ratio. In the example code the optimizer solnp from the R package Rsolnp is used, instead of a specialized function from a financial package, in order to retain the flexibility of optimizing other non standard measures.

There are a number of approaches that could be used for stock selection. The brute force method would be to iterate over all possible combinations, which quickly becomes a very large problem to solve with limited computational resources. For example to check all four name portfolios out of a universe of N stocks would require N*(N-1)*(N-2)*(N-3)/24 combinations.

On the other end on can use a bootstrapping method, by first selecting the stock with the single best sharp ratio. Finding this stock requires N computations. Next checking all two stock portfolios that are composed out of the best stock and all possible selections from the universe adds N-1 calculations. Repeating this sequence of adding just single names at a time leads first to a three name and finally to a four name portfolio after a total of 4*N-6 steps. Having arrived at an already half-way optimized four name portfolio one can then try to improve it further by rotating out individual names against the universe of stocks.

A systematic variation to the try and error bootstrapping leads to the tree based search method illustrated in the following.

The picture below for the universe of 30 DJI stocks starts off at the bottom node with the portfolio the stocks “JPM”, “HPQ”, “AA”, BAC” with a low performance sharp ratio of -0.42. From this the tree search generates four sub portfolio combinations, (“HPQ”, “AA”, BAC”), (“JPM”,”AA”, BAC”),(“JPM”, “HPQ”, BAC”) and (“JPM”, “HPQ”, “AA”) by removing one member each. Then it linear-searches among the universe of stocks for the best name to add to each sub portfolio in order to arrive at a new set of four-name portfolios. The one with the best sharp ratio is then selected. In the graph below this was the combination (“AA”, “BAC”,”HPQ”,”MCD”). It becomes the root node of the new search tree branch. Eventually the tree search ends when no new combinations with higher sharp ratio can be found along the deepest search path.

The following shows a similar search tree, now using the full 5000 NYSE and NASDAQ stocks for that we got sufficient historical data without missing dates.

The graphs below depict the historical cumulative performance of the optimized portfolio against the performance of the initial portfolio. Note that the search among the 5000 stock universe yielded a growth portfolio with much smaller variance than just selecting among the 30 DJI stocks.

Further optimization approaches are Monte Carlo like tactics of selecting random starting points for multiple searches. Other extensions are genetic algorithms that create new trial portfolios out the genetic crossings of two parent portfolios. Overall the tree approach here seemed to be robust against varying the selection of the starting root portfolio and converges after few search nodes.

This post is mainly to demonstrate the optimization technique rather than to advertise a particular portfolio selection. In reality a portfolio that performed particularly well in one year is not guaranteed to have similar characteristics in the following year. A more complete study would include back-testing and cross-validation of the selected portfolios against data that was not used during optimization.

To run the tree search first create the following functions in R by copying and pasting

In this post I like to illustrate the R package “ape” for phylogenetic trees for the purpose of assembling trees. The function read.tree creates a tree from a text description. For example the following code creates and displays two elementary trees:

Note that for better illustration in the example above the length are chosen differently for the branches that lead to each leaf or tip.

Next the function bind.tree is used to attach the child tree to the root tree. The argument where determines the position within the parent root tree where the child branch should be attached. In the first example below the child branch is attached to the branch of the ‘TIP1″ leaf. In this process the receiving leaf label is replaced by the new branch. In order to retain the receiving leaf label, I set the node.label attribute of the incoming child tree to its name.

A short post about online educational resources on machine learning. Perhaps it is a sign of increasing popularity of the field that there are now several courses on machine learning accessible online and for free.

Out of Standford University comes Andrew Ng’s almost ‘legendary’ machine learning lecture presented on the Coursera platform. The class consists of video lectures, multiple choice style review quizzes and programming projects in Matlab. A discussion forum supplies student interaction that otherwise only a traditional university program offers. The class covers a good verity of supervised and unsupervised learning concepts, like neural nets, support vector machines, k-means and recommender systems.

Another online course is Berkeley’s CS188.1x Artificial Intelligence out of the EdX platform by Harvard and MIT. This course offers video lectures on topic like decision theory and reenforcement learning with programming assignments in Python.

A third offering in this direction is Sebasitan Thrun’s course on Artificial Intelligence presented at Udacity. Here the lectures are themed around applications to robotics on the example of Google’s driverless car . Udacity’s takes an interesting educational approach. It presents lectures in short sequences that are frequently interrupted by quick multiple choice style review questions. The purpose is to keep the students attention focused on the lecture and provide the feel of real teacher interaction. Similar to the EdX course programming projects are done in Python. Here the student can paste his code directly into the Udacity browser window for instant feedback, without actually having to run Python on the local machine.

Another platform to watch is Kahn Academy that originally started at the elementary to high school education but appears to be destined to expand into the university and professional level.

With the cost of learning at traditional universities exploding, at least in some parts of the world, it is fascinating to observe the formation of free or low cost online educational platforms. Possible revenue models for these platforms could be that lectures are going to be offered for free but a payment is requested for tests and certifications. There are already professional designations like the chartered financial analyst (CFA) or the FRM for risk management that are based on certifications outside the traditional university education setting. Interestingly a lot of development in the e-learning area comes out of the traditional schools, as a successful online program is certainly a great advertisement and recruitment tool by itself.

Machine Learning and Kernels

A common application of machine learning (ML) is the learning and classification of a set of raw data features by a ML algorithm or technique. In this context a ML kernel acts to the ML algorithm like sunshades, a telescope or a magnifying glass to the observing eye of a student learner. A good kernel filters the raw data and presents its features to the machine in a way that makes the learning task as simple as possible.

Historically a lot of progress in machine learning has been made in the development sophisticated learning algorithms, however selecting appropriate kernels remains a largely manual and time consuming task.

This post is inspired by a presentation by Prof. Mehryar Mohri about learning kernels. It reflects on the importance of kernels in support vector machines (SVM).

A total of three examples are presented. A linear kernel is shown to solve the first example but fails for the second task. There a square kernel is successful. Then a third example is presented for that both linear and square kernels are not sufficient. There a successful kernel can be generated out of a mixture of both base kernels. This illustrates that kernels can be generated out of bases, resulting in products that are more powerful in solving the task at hand than each individual components.

Support Vector Machines

Consider a support vector machine (SVM) for a classification task. Given a set of pairs of feature data-point vectors x and classifier labels y={-1,1}, the task of the SVM algorithm is to learn to group features x by classifiers. After training on a known data set the SVM machine is intended to correctly predict the class y of an previously unseen feature vector x.

Applications in quantitative finance of support vector machines include for example predictive tasks, where x consists of features derived from a historical stock indicator time series and y is a sell or buy signal. Another example could be that x consist of counts of key-words within a text such as an news announcements and y categorizes it again according on its impact to market movements. Outside of finance a text based SVM could be used to filter e-mail to be forwarded to either the inbox or the spam folder.

Linear Kernel

As indicated above the SVM works by grouping feature points according to its classifiers.
For illustration in the toy example below two dimensional feature vectors x={x1,x2} are generated in such a way that the class y=-1 points (triangles) are nicely separated from the class y=1 (circles).

The SVM algorithm finds the largest possible linear margin that separates these two regions. The marginal separators rest on the outpost points that are right on the front line of their respective regions. These points, marked as two bold triangles and one bold circle in the picture below, are named the ‘support vectors’ as they are supporting the separation boundary lines. In fact the SVM learning task fully consists of determining these support vector points and the margin distance that separates the regions. After training all other non-support points are not used for prediction.

In linear feature space the support vectors add to an overall hypothesis vector h, , such that the classification frontiers are given by the lines and centered around .

The code below utilizes the ksvm implementation in the R package ‘kernlab’, making use of “Jean-Philippe Vert’s” tutorials for graphing the classification separation lines.

Quadratic Kernel

The following example illustrates a case where the feature points are non-linear separated. Points of the class y=1 (circles below) are placed in a inner region surrounded from all sides by points of class y=-1, again depicted as triangles. In this example there is no single straight (linear) line that can separate both regions. However here it is still possible to find such a separator by transforming the points from feature space to a quadratic kernel space with points given by the corresponding square coordinates .

The technique of transforming from feature space into a measure that allows for a linear separation can be formalized in terms of kernels. Assuming is a vector coordinate transformation function. For example a squared coordinate space would be . The SVM separation task is now acting on in the transformed space to find the support vectors that generate

for the hypothesis vector given by the sum over support vector points . Putting both expressions together we get

with the scalar kernel function . The kernel is composed out of the scalar product between a support vector and another feature vector point in the transformed space.

In practice the SVM algorithm can be fully expressed in terms of kernels without having to actually specify the feature space transformation. Popular kernels are for example higher powers of the linear scalar product (polynomial kernel). Another example is a probability weighed distance between two points (Gaussian kernel).

Implementing a two dimensional quadratic kernel function allows the SVM algorithm to find support vectors and correctly separate the regions. Below the graph below illustrates that the non-linear regions are linearly separated after transforming to the squared kernel space.

The “R” implementation makes use of ksvm’s flexibility to allow for custom kernel functions. The function ‘kfunction’ returns a linear scalar product kernel for parameters (1,0) and a quadratic kernel function for parameters (0,1).

Alignment and Kernel Mixture

The final exploratory feature data set consists again of two classes of points within two dimensional space. This time two distinct regions of points are separated by a parabolic boundary, where vector points of class y=1 (circles) are below and y=-1 (triangles) are above the separating curve. The example is selected for its property that neither the linear nor the quadratic kernels alone are able to resolve the SVM classification problem.

The second graph below illustrates that feature points of both classes are scattered onto overlapping regions in the quadratic kernel space. It indicates that for this case the sole utilization of the quadratic kernel is not enough to resolve the classification problem.

In “Algorithms for Learning Kernels Based on Centered Alignment” Corinna Cortes, Mehryar Mohri and Afshin Rostamizadeh compose mixed kernels out of base kernel functions. This is perhaps similar to how a vector can be composed out of its coordinate base vectors or a function can be assembled in functional Hilbert space.

In our example we form a new mixed kernel out of the linear and the quadratic kernels and

The graph below demonstrate that the mixed kernel successfully solves the classification problem even thought each individual base kernels are insufficient on its own. In experimentation the actual values of the mixture weights , are not critical. Following the study above we choose the weights according to how much each individual base kernel on its individual aligned with the raw classification matrix composed out of the classifier vector .

Alignment is based on the observation that a perfectly selected kernel matrix would trivially solve the classification problem if its elements are equal to the matrix. In that case choosing an arbitrary vector as support the the kernel would give if and are in the same category and otherwise.

Similar to the concept of correlation, kernel alignment between two kernels is defined by the expectation over the feature space, resulting in matrix products

with the product . The expectation is take assuming centered kernel matrices according to . (For details see reference above.)

Kernel matrix centering and alignment functions are implemented in the ‘R’ as following

In conclusion this post attempts to illuminate the importance role of kernel selection in machine learning and demonstrates the use of kernel mixture techniques. Kernel weights were chosen somewhat ad-hoc by alignments to the target classifiers, thus preceding the actual SVM learning phase. More sophisticated algorithms and techniques that automate and combine these steps are a topic of current ongoing research.

Ideas from this area are illustrated with an elementary example of an asset that follows a binomial process. At first the bid-ask spread is evaluated to be zero under a risk-neutral market assumption. Then the example illustrates how a conic market distortion can create a non-zero spread.

Complete Risk Neutral Market

A fundamental concept in financial theory is the assumption of a complete market and the existence of a risk neutral measure. A risk neutral market is indifferent to either buying or selling at any size since all cashflows can be hedged perfectly and no residual risk is left.

Let’s for example consider being long an asset X in a market with a risk neutral probability for a one period move of (1/3) to $120 and (2/3) to $90.

The risk neutral expectation holding this position is

The corresponding short position in X represents itself as

with expected outcome of

A risk neutral market does not have any preference between being long or short. It will pay (bid) $100 for owning the asset and charge (ask) $100 for assuming the liability.

The transition probabilities imply cumulative probabilities given by the following step functions

as depict in this graph

We also have

Bidding and Asking

Generally real markets have a bid-ask spread. The ask price a is the price one must pay to buy an asset or a product from the market. The bid price b is the price one receives selling an asset to the market. Generally we observe a positive bid-ask spread

In a market that depicts a non zero bid-ask spread long assets and liabilities (short assets) are valued differently. Unwinding a position of a long asset entails selling the asset to the market for the bid price, thus the bid price reflects the value of a long position. Unwinding liabilities, or a position in a short asset, requires buying the asset from the market for paying the asking price, which reflects the liability of a short position.

The MINVAR Distortion Function

As seen above the risk neutral market distribution does imply a zero bid-ask spread. In order to introduce a difference in valuation between long and short positions the distribution is distorted away from the risk neutral mean. The result of the distortion is that greater weights are given towards losses for the long position and larger liabilities for the short position. The leads to a lower expected value of the long position, thus a lower bid price, and a greater expected liability for the short position, thus a higher ask price.

At this point we regard the selection of the distortion function as a model input. In this framework the distortion function maps from risk neutral probabilities in the range from 0 to 1 to ‘real’ market probabilities also in the range from 0 to 1, necessarily mapping the points 0 and 1 onto itself.

One interesting choice in the possible class of distortion function is the ‘minvar’ function

For n=0 the function is the trivial identity , for larger value of n it is a monotonic increasing concave function, with each larger order of n function dominating the previous orders.

As Cherny and Madan pointing out the minvar function has a nice interpretation. It reflects the probability distribution of the minimum over n+1 successive drawings. Concretely for a given probability function y(x)=P(X<x) , the minvar distortion reflects the probability that the minimum of n+1 independent samples of X is below x. By construction, the term is the probability that all n+1 samples are greater than x. That means the minimum of the samples is greater than x . The complementary probability that the minimum is smaller than x is then given by the difference to 1.

As a caveat the minvar function is only one possible choice for a distortion function and the discussion above linking it to the distribution of the minimum is not needed in the present context.

Distorted Cumulative Probabilities

Taking the MINVAR function for n=1 we have

In the following we use to directly distort the risk neutral probabilities from the section above. Alternatively one could first removed the mean price of $100, and applied the distortion to the net P&L process. However for the sake of better comparisons between the risk neutral and distorted probabilities and to clearly distinguish the long and short position the mean is left in. With this we get

Bid Price: Distorted Asset

The price process for the long asset becomes

with the expectation

Bidding prices up to the calculated amount contain all possible trades with expected positive net cash flows, or positive alpha trades. On the other hand competitive market forces maximize the bid price up to this acceptable limit.

The formal calculation of the bid price in the collection of reference papers under the aforementioned link is given as the expectation integral over the distorted asset distribution

Ask Price: Distorted Liability

The distorted short position involves according to

with expectation

Here the market is asking to be compensated for taking on the trade. Again any in absolute terms larger bid price contains the possible positive alpha positions the market is willing to accept while competitive forces limit the asking price to the calculated amount.

Consequently the asking price is given by the (negative) expectation integral over the distorted liability distribution

or in terms of the long asset distribution , after a change of signs resulting in an ask price of

They identity the capital reserve requirement for holding a position in X with the size of the bid-ask spread, since it is the charge to be paid for unwinding in the presence of otherwise prefect market hedges. Similarly the difference between the mid and the risk-neutral price is the average profit that is disseminated between market participants by trading. Note that the simple binomial state example in this post is not rich enough to feature non-equal mid and risk neutral expected prices.

In summary we find the distorted expectation of the long asset to be lower than the (absolute) price of the short asset. In accordance with the discussion above we identify the bid-ask prices from the asset and the liability valuated on the distorted market. In our simple example the bid and ask prices are centered around the risk-neutral mid price.

Ask: $106.66 (liability)
Bid: $93.33 (asset)
Mid: $100
Spread: $13.33

This concludes this note on the concept of conic finance illustrated on an elementary example in the context of bid-ask spreads. The methodology has a rich area of application to markets such as expected profits, capital requirements, leveraging, hedging etc, which can be found under the reference link to An Introduction to Conic Finance and its Applications.