Relative valuation is a popular technique which relies on the market value of similar assets to price a given asset. For example, if you want to price a laundry business, you might want to calculate the price to earnings ratio of other similar businesses and multiply this ratio by the earnings of the laundry business to obtain a valuation. Sheridan and Martin describe this methodology as a 4-step process:

Step 1: Identify similar or comparable investments and recent market prices for each.

Step 2: Calculate a valuation metric for use in valuing the asset.

Step 3: Calculate an initial estimate of value.

Step 4: Refine or tailor your initial valuation estimate to the specific characteristics of the investment.

Some of the most commonly used multiples are:

Price to earnings (P/E)

Market to book value of equity

Enterprise Value (EV) to EBITDA

Enterprise Value (EV) to revenue

Enterprise Value (EV) to cash flow

This list is not exhaustive, and it is possible for you to create your own multiple. This is particularly popular in the technology sector where analysts have come up with multiples such as Enterprise Value to unique visitors or website hits. In doing so, you must ensure that the components of the multiple are consistent with each other. For example, you might consider using the price to sales ratio as a valuation multiple. However, an implicit assumption behind this multiple is that comparable companies have identical capital structures, which is very rarely the case in practice. When this assumption is violated, the multiple becomes inconsistent because it is impacted by the amount of debt that a company has relative to its equity.

Finally, a key step in applying this methodology is to determine which multiple is appropriate for the asset you are trying to value. For example, valuing young firms and startups using the P/E ratio is likely to be inappropriate if those firms have negative or highly volatile earnings. Instead, using the EV to sales ratio would likely give a better estimate. Additionally, it is important to realize that these multiples have different characteristics. While EV to EBITDA ignores the firm's CapEx, depreciation, and capital structure, while the P/E ratio takes those into account. Using these multiples concurrently allows you to see the big picture and understand what is driving the valuation of an asset.

In this notebook, we systematize the methodology introduced above for companies in the S&P 500 using two different Machine Learning approaches. First, we replicate Gael Varoquaux's analysis on scikit-learn which extracts a graphical structure from the correlation of intraday variations and then applies Affinity Propagation to group together stocks that behave similarly. Second, we analyze the companies' latest 10K using the Doc2Vec implementation in gensim to quantify similarity. In the first case, we use the average multiple in each cluster to value each company. In the second case, we pick the top 3 comparable companies and transform the similarity measures into weights to compute the average comparable multiple. We then apply this multiple to each company's financials to obtain a valuation.

The hope of using this approach is to capture relationships between companies that human analysts would not be able to perceive. Reading through the financial statements of all the companies in the S&P 500 and forming an opinion about how these companies relate to each other would be an extremely time-consuming process. As a result, most analysts on the Street cover a specific industry group, and therefore, they might not realize that a company they are researching is actually very similar to a company in classified in a different industry group. By using Machine Learning, we can alleviate this issue, and hopefully, gain some valuable insights about the stock market structure.

We scrape the list of companies in the S&P 500, their ticker and CIK code from Wikipedia. We then use the CIK code to scrape the latest 10K from EDGAR. There are 505 companies in the Wikipedia list because some companies trade under multiple symbols (for example, Discovery Communications Inc.). A few companies' financial statements are not available on EDGAR for various reasons -- we ignore those companies. We clean the data by removing "Table of Contents" markers when they exist, page numbers, line breaks, punctuations and numbers from the statements. We scrape company fundamentals and their historical prices from Yahoo! Finance. Note that we drop the companies with negative multiples to simplify valuation. While Yahoo! Finance might not be a particularly reliable source for getting data about company fundamentals, it is the best among free data sources.

importnumpyasnpimportpandasaspdfrombs4importBeautifulSoupfromurllib.requestimporturlopendeftable_extractor(soup):"""Extract the tables from a soup object"""fortableinsoup.find_all("table"):table.extract()returnsoupsp_500_wiki_link='https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'soup_wiki=BeautifulSoup(urlopen(sp_500_wiki_link),'html.parser')table_wiki=soup_wiki.find("table",{"class":"wikitable sortable"})# Fail now if the right table hasn't been foundheader=table_wiki.findAll('th')ifheader[0].string!="Ticker symbol"orheader[1].string!="Security":raiseException("Can't parse wikipedia's table!")# Retreive the values in the tablerecords=[]rows=table_wiki.findAll('tr')forrowinrows:fields=row.findAll('td')iffields:# Get info and SEC company linksymbol=fields[0].stringwiki_link="https://en.wikipedia.org/wiki"+fields[1].a.get('href')CIK=fields[7].stringsec_company_link="https://www.sec.gov/cgi-bin/browse-edgar?"+ \
"action=getcompany&CIK="+CIK+ \
"&type=10-K&dateb=&owner=include&count=40"name=fields[1].a.stringsector=fields[3].string# Get link for the page with latest 10-K related filingssoup_comp=BeautifulSoup(urlopen(sec_company_link),'html.parser')table_comp=soup_comp.find("table",{"class":"tableFile2"})try:# Get the latest filing pagefilings_link="https://www.sec.gov"+ \
table_comp.a.get('href')# Get the link for the latest 10Ksoup_filings=BeautifulSoup(urlopen(filings_link),'html.parser')table_filings=soup_filings.find("table",{"class":"tableFile"})_10K_link="https://www.sec.gov"+table_filings.a.get('href')# Extracting the text from the latest 10Ktry:soup_latest_10K=BeautifulSoup(urlopen(_10K_link).read(),'html.parser')soup_latest_10K=table_extractor(soup_latest_10K)_latest_10K_txt=soup_latest_10K.get_text()except:# If the latest 10K is not available, return N/A_latest_10K_txt=np.nanexcept:# If the filings are not available, return N/A_10K_link=np.nan_latest_10K_txt=np.nan# Append resultsrecords.append([symbol,wiki_link,name,sector,sec_company_link,CIK,_latest_10K_txt])headers=['Symbol','Wikipedia Link','Name','Sector','SEC Filings Link','CIK','Latest 10K']data=pd.DataFrame(records,columns=headers)# Correct ambiguous tickers for later purposeambiguous_tickers=['BRK.B','BF.B']corrected_tickers=['BRK-B','BF-B']fori,tickerinenumerate(ambiguous_tickers):data['Symbol']=data['Symbol'].replace(ticker,corrected_tickers[i])

Disclaimer: This scraper makes an important assumption regarding the structure of the html file. Specifically, it assumes that the data for each item of interest is the string of text located in the 'td' tag following the one where the name of the item was found.

Doc2Vec is an implementation of the Paragraph Vector unsupervised learning algorithm proposed by Quoc Le and Tomas Mikolov for learning fixed-length feature representations from variable-length pieces of texts. It aims to overcome some of the weaknesses of the more traditional bag-of-words (BOW) methodology.

We use three different methodologies for visualizing the structure of the stock market. First, we create a simple scatter plot of the companies' EV/EBITDA against their P/E ratio. Second, we use the correlation matrix of the S&P 500 stocks between the start of 2013 and the last price. Third, we use the similarity matrix outputted from the Doc2Vec model. In the last two cases, t-SNE is used for dimensionality reduction.

There is a clear positive relationship between a company's EV/EBITDA and P/E ratio. While most companies are clustered in the bottom left quadrant, there are a few outliers. We choose to keep those outliers for valuation because they still provide valuable information about companies as we will see later.

In [14]:

fromsklearn.manifoldimportTSNEtsne=TSNE(n_components=2,random_state=0,perplexity=10.0)Y=tsne.fit_transform(similarity_matrix_NLP)plotting_df=pd.concat([data[['Name','Symbol','Sector','Color']],pd.DataFrame(Y,columns=['x','y'])],axis=1)source=ColumnDataSource(dict(x=plotting_df['x'],y=plotting_df['y'],color=plotting_df['Color'],label=plotting_df['Name'],ticker=plotting_df['Symbol'],sector=plotting_df['Sector']))hover=HoverTool(tooltips=[("index","$index"),("name","@label"),("sector","@sector"),("ticker","@ticker"),("(x,y)","($x, $y)"),])p=figure(tools=[TOOLS,hover],plot_width=800,plot_height=700)labels=LabelSet(x='x',y='y',text='label',source=source,text_font_size='8pt')p.scatter(x='x',y='y',color='color',legend='sector',source=source)p.title.text="2D Company Similarity Visualization (NLP)"p.title.text_font_size="25px"# Uncomment the following line to add labels# p.add_layout(labels)show(p)

While companies are mostly clustered by their industry groups, this analysis seems to capture relationships between companies which go beyond those. The fact that companies such as Activision Blizzard and Electronic Arts are close to each other suggests that the model is doing something right. However, note that it is possible that the model relies on trivial factors, for example, attributing a high similarity score to companies whose financial statements have the same auditor (say, because the writing style is similar).

In [15]:

fromsklearnimportcluster,covariance# This code was adapted from Gael Varoquaux's work (see references)# Calculate intraday variationvariation_df=close_price-open_pricevariation_df=variation_df.T.reindex(data['Symbol'])# Get name, sector and color from the data dataframevariation_df= \
data[['Symbol','Name','Sector','Color']].join(variation_df,on='Symbol')# Drop rows with NAsvariation_df=variation_df.dropna(axis=0)# Data for the modelvar_data=variation_df.drop(['Symbol','Name','Sector','Color'],axis=1).T# Learn a graphical structure from the correlationsedge_model=covariance.GraphLassoCV()# Standardize the time series: using correlations rather than covariance is# more efficient for structure recoveryvar_data/=var_data.std(axis=0)edge_model.fit(var_data)# Cluster using affinity propagation_,labels=cluster.affinity_propagation(edge_model.covariance_)n_labels=labels.max()variation_df['Cluster']=np.nanforiinrange(n_labels+1):variation_df.loc[labels==i,'Cluster']=i+1print('Cluster %i: %s'%((i+1),', '.join(variation_df['Name'][labels==i])))# Find a low-dimension embedding for visualization: find the best position of# the nodes (the stocks) on a 2D planeembedding=tsne.fit_transform(var_data.T)# Display a graph of the partial correlationspartial_correlations=edge_model.precision_.copy()d=1/np.sqrt(np.diag(partial_correlations))partial_correlations*=dpartial_correlations*=d[:,np.newaxis]non_zero=(np.abs(np.triu(partial_correlations,k=1))>0.06)# Plot the edgesstart_idx,end_idx=np.where(non_zero)segments=[[embedding.T[:,start],embedding.T[:,stop]]forstart,stopinzip(start_idx,end_idx)]values=np.abs(partial_correlations[non_zero])

d1=variation_df[['Name','Symbol','Sector','Color']].reset_index(drop=True)d2=pd.DataFrame(np.stack(embedding,axis=1).T,columns=['x','y'])plotting_df=pd.concat([d1,d2],axis=1)source=ColumnDataSource(dict(x=plotting_df['x'],y=plotting_df['y'],color=plotting_df['Color'],label=plotting_df['Name'],ticker=plotting_df['Symbol'],sector=plotting_df['Sector']))hover=HoverTool(tooltips=[("index","$index"),("name","@label"),("sector","@sector"),("ticker","@ticker"),("(x,y)","($x, $y)"),])p=figure(tools=[TOOLS,hover],plot_width=800,plot_height=700)labels=LabelSet(x='x',y='y',text='label',source=source,text_font_size='8pt')p.scatter(x='x',y='y',color='color',legend='sector',source=source)p.title.text="2D Company Similarity Visualization (Correlation)"p.title.text_font_size="25px"# Comment this line to remove segmentsp.segment(*np.reshape(np.array(segments).flatten(),(len(segments),4)).T)# Uncomment the following line to add labels# p.add_layout(labels)show(p)

Here, companies are much more clearly clustered by industry groups. For example, one can see that airline companies (American Airlines Group, Delta Air Lines, etc) are clustered together and connected to each other.

Disclaimer: While the system appears to be promising, there are many reasons other than luck for which this should not be taken at face value. First, we use the last price of 2016 as a reference, but the financial statements are published later even though they reflect the state of the company at the end of the year. Additionally, many more tests would actually need to performed to evaluate the quality of these signals which is beyond the aim of this notebook. As such, you should not consider using these results for taking investment decisions.

A positive value means that the company should be trading at a higher value (as of the end of 2016). Overall, there appears to be a positive relationship between the valuation based on those two multiples. One caveat, however, is the limited amount of companies. Some companies in the S&P500 have no good comparable companies, and therefore, the percentage of over/undervaluation suggested by the model should not be taken at face value. For example, companies such as Netflix and Amazon appear highly overvalued based on this model. However, this is because idiosyncratic factors have a large influence on these companies which are not captured by this methodology. In fact, on an EV/Revenue basis, the model actually suggests that Amazon is undervalued. Conducting this analysis with a much larger dataset would partially alleviate this issue. An interesting observation is that Nvidia, which has performed very well recently due to developments in the GPU space is still not as "overvalued" as Netflix on the basis of those two multiples.

This methodology gives valuations which are much less extreme than the NLP approach. One potential explanation for this is that clusters are closer to the traditional industry groups, and therefore, to how the market values these companies.