Transcription

2 Abstract Governments around the world want to develop their ICT and digital industries. Policymakers thus need a clear sense of the size and characteristics of digital businesses, but this is hard to do with conventional datasets and industry codes. This paper uses innovative big data resources to perform an alternative analysis at company level, focusing on ICT-producing firms in the UK (which the UK government refers to as the information economy ). Exploiting a combination of public, observed and modelled variables, we develop a novel sectorproduct approach and use text mining to provide further detail on the activities of key sector-product cells. On our preferred estimates, we find that counts of information economy firms are 42% larger than SIC-based estimates, with at least 70,000 more companies. We also find ICT employment shares over double the conventional estimates, although this result is more speculative. Our findings are robust to various scope, selection and sample construction challenges. We use our experiences to reflect on the broader pros and cons of frontier data use. JEL classification: C55; C81; L63; L86; O38 Key words: Quantitative methods, firm-level analysis, Big Data, text mining, ICTs, digital economy, industrial policy This paper was produced as part of the Centre s Productivity and Innovation Programme. The Centre for Economic Performance is financed by the Economic and Social Research Council. Acknowledgements This paper is part of a project funded by NESTA, and builds on earlier work funded by Google. Many thanks to Tom Gatten, Prash Majmudar and Alex Mitchell at Growth Intelligence for data, and help with its preparation and interpretation. Thanks to Rosa Sanchis-Guarner for maps. For advice and helpful comments, thanks to Hasan Bakhshi, Theo Bertram, Siobhan Carey, Steve Dempsey, Juan Mateos-Garcia, Jonathan Portes, Rebecca Riley, Chiara Rosazza-Bondibene, Brian Stockdale, Dominic Webber and Stian Westlake plus participants at workshops organised by Birmingham University, Google, NEMODE, NIESR and TechUK. This work includes analysis based on data from the Business Structure Database, produced by the Office for National Statistics (ONS) and supplied by the Secure Data Service at the UK Data Archive. The data is Crown copyright and reproduced with the permission of the controller of HMSO and Queen's Printer for Scotland. The use of the ONS statistical data in this work does not imply the endorsement of the ONS or the Secure Data Service at the UK Data Archive in relation to the interpretation or analysis of the data. This work uses research datasets that may not exactly reproduce National Statistics aggregates. All the outputs have been granted final clearance by the staff of the SDS-UKDA. The paper gives the views of the authors, not the funders or the data providers. Any errors and omissions are our own. Max Nathan, SERC, NIESR and IZA. Anna Rosso, NIESR. Published by Centre for Economic Performance London School of Economics and Political Science Houghton Street London WC2A 2AE All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means without the prior permission in writing of the publisher nor be issued to the public or circulated in any form other than that in which it is published. Requests for permission to reproduce any article or part of the Working Paper should be sent to the editor at the above address. M. Nathan and A. Rosso, submitted 2014

3 1. Introduction Information and Communications Technologies (ICTs) - and the 'digital economy' they support - are of enduring interest to researchers and policymakers. National and local government are particularly keen to understand the characteristics and growth potential of 'their' digital businesses. Given the recent resurgence of interest in industrial policy across many developed countries (Rodrik 2004; Aiginger 2007; Harrison and Rodríguez-Clare 2009; Aghion, Dewatripont et al. 2012; Aghion, Besley et al. 2013), there is now substantial policy interest in developing stronger, more competitive digital economies. For example, the UK's industrial strategy (Cable 2012) combines horizontal interventions with support for seven key sectors, of which the 'information economy' is one (Department for Business Innovation and Skills 2012; Department for Business Innovation and Skills 2013). The desire to grow hightech clusters is often prominent in the policy mix - for instance the UK's Tech City initiative, Regional Innovation Clusters in the US and elements of 'smart specialisation' policies in the EU (Nathan and Overman 2013). In this paper we use novel 'big data' sources to improve our understanding of 'information economy' businesses in the UK - those involved in the production of ICTs. We also use this experience to critically reflect on some of the opportunities and challenges presented by big data tools and analytics for economic research and policymaking. For policymakers, a solid understanding of these sectors, products and firms is necessary to design effective interventions. However, it is hard to do this using conventional administrative datasets and industry codes. Data coverage is often imperfect, industry typologies can lack detail, and product categories do not closely align with sector space. More broadly, real-world features of an industry tend to evolve ahead of any given industrial typology. The UK Government is clear about these challenges: Addressing the lack of clear and universally-agreed metrics will be an early priority for Government and industry. There will be a need for continual reassessment of the scope and definition of the information economy as it evolves. (BIS 2013, p11) 4

4 We use an innovative dataset developed by Growth Intelligence (hence Gi), which deploys an unusual combination of public administrative data, observed information, and modelled variables from unstructured sources and developed using machine learning techniques. We use this off-the-shelf material to develop a novel 'sector-product' mapping of ICT firms. We also take raw text fragments derived by Gi from company websites, and use text mining to shed further light on key sector-product cells. We run these analyses on a benchmarking sample of companies that allows direct comparisons of conventional and big data-driven estimations. The differences are non-trivial: in our preferred estimates we find that the ICT production space is around 42% larger than SIC-based estimates, with at least 70,000 more companies. We also find employment shares over double the conventional estimates, although this result is more speculative. This approach delivers significant extra dimensionality and detail compared to simply using SIC codes, but it is not without limitations. This brings us to the second contribution of the paper, in which we draw on our experience to highlight opportunities and challenges for researchers working with similar big data methodologies. The use of non-traditional / unstructured sources and scraping/mining/learning tools is growing rapidly in the social sciences (Einav and Levin 2013; King 2013; Varian 2014). Enthusiasts point to huge potential in closing knowledge gaps, and taking research closer to the policy cycle. Sceptics highlight potentially limited access and relevance of these 'frontier' datasets. We use our work to discuss the substantial richness big data can bring to innovation research, and talk through issues of access and relevance, as well as coverage, reliability, quality and working practices that researchers are likely to encounter. The paper is structured as follows. Section 2 defines key terms and issues. Section 3 introduces the Growth Intelligence dataset and other data resources, and outlines potential pros and cons of big data approaches. Sections 4 and 5 respectively detail sample construction and identification steps. Sections 6 and 7 give descriptive results. Section 8 concludes. 5

5 2. Context and key issues Our research questions are: first, what is the true extent of ICT manufacturing and service activity in the UK, and what are the key characteristics of these businesses? Second, what are the differences between big data-driven estimates and those from conventional administrative datasets? 2.1 The 'digital economy', the information economy and ICT production Governments in the UK and elsewhere are keen to grow their 'digital economies'. What does this mean in practice? The digital economy is an economic system based on digital technologies (Negroponte 1996; Tapscott 1997). This is an ecosystem of sorts: an interlocking set of sectors (industries and firms), outputs (both supporting products and services, and the content these are used to generate), and a set of production and distribution inputs used at varying intensities by firms and workers across all sectors (OECD 2011; OECD 2013). We could also define a set of cross-industry occupations where such technological tools are essential to the main tasks (Brynjolfsson and Hitt 2003; Acemoglu and Autor 2011). Our analysis focuses on the production side of this system, where we map both industries and outputs. We ignore inputs, for the simple reason that it is now hard to think of any economic activity where digital inputs do not feature, and given the pace of change in (say) internet tools and platforms, definition and measurement problems for digital inputs are severe (see Lehr (2012) and OECD (2013) for a discussion of these issues). And as discussed above, while policymakers are keen to improve ICT infrastructure such as broadband networks, they are also increasingly interested in helping sectors and firms to grow. The standard OECD/UN definitions of digital activities comprise detailed product/service groups identified by an international expert panel: these are then aggregated into less detailed 4-digit standard industry code (SIC) bins (OECD 2011). 1 These SICs form the basis of most analysis. That is, the definition moves from fine-grained to rougher grained, and is typically 1 We use the most recent agreed definitions available at the time of writing, as developed by the OECD Working Party on Indicators for the Information Society (WPIIS). WPIIS agrees product lists using UN Central Product Classification (CPC) codes, then crosswalks these onto SIC digit cells. See OECD (2011) for detail. 6

6 one-dimensional. By contrast, we are able to use industry and product information for our alternative mapping and analytics, as we explain in Section 5 below. The OECD s three main supply-side activity groups are a) information and communication technologies (ICT), covering computer manufacture, IT and telecoms networks and services and software publishing; b) digital content, covering digital / online activities in music, TV, film, advertising, architecture, design, and e-commerce; and c) wholesale, leasing, installation and repair activities in both ICT and content space. In this paper we focus on the production of ICT goods and services, rather than content developed using these tools and platforms. Specifically, we are interested in the sectors delineated in the UK Department of Business' 'information economy strategy' (Department for Business Innovation and Skills 2012; Department for Business Innovation and Skills 2013). We refer to firms in these industries as 'information economy businesses'. There is a live debate in the UK about exactly how broadly to define the information economy in industry terms. Some analysts prefer a very narrow definition, which concentrates purely on ICT manufacturing; conversely, some industry voices would like a much broader approach that includes manufacturing, services and related supply chain activity (such as wholesale, retail, installation and repair). This means that alternative mappings of the information economy need to take into account these differences of opinion. We take ICT services and manufacturing as our base case (see Table 1), and show that our results are robust to narrower and broader starting sets. 2 2 We use the whole UN/OECD set of digital economy SIC4 codes as a starting point for our analysis, then crosswalk these to 5-digit level and make some adjustments made for the information economy element in a UK context. Following consultation with BIS we exclude the SIC5 cells ('engineering design activities for industrial processes and production') and ('engineering-related scientific and technical consulting activities') specified by the OECD (personal communication, 2 December 2013). Conversely, we exclude the BIS-specified cells ('news agency activities') and ('other information service activities not elsewhere classified') because they are included in the UN/OECD list of content sectors, rather than ICT production. Our robustness checks cover ICT services only (excluding ICT manufacturing, code 26) and a broader set of SICs comprising manufacturing, services and supply chain activity including (Repair of machinery), (Repair of other Equipment), (Repair of Electrical Equipment), (Installation of industrial machinery and equipment), (Repair of computer and peripheral equipment), (Other engineering activities), (Engineering related scientific and technical consulting activities), (Engineering design activities for industrial process and production). 7

7 Table 1 about here In an earlier paper (Nathan and Rosso 2013) we conduct exploratory analysis on both ICT and digital content activities. The latter is substantially harder to delineate in sector terms, not least because most content sectors are rapidly shifting from physical to multi-platform, online and offline outputs (Bakhshi and Mateos-Garcia 2012; 2013) and because many product categories bleed across sector boundaries (see below). 2.2 Measuring ICT production activity Ascribing activities to sectors is necessarily an imprecise process, particularly when conventional, administrative datasets are used. In the UK there are three principal issues. The first issue is about data coverage. For firm-level analysis, the main UK administrative source is the Business Structure Database (BSD), which draws on sales tax, employment and company records as well as government business surveys (Office of National Statistics 2010; Office of National Statistics 2012). However, the BSD only includes firms paying UK sales tax and/or those with at least one employee on the payroll. Pooling across sectors, the BSD covers 99% of UK enterprises but for sectors with large numbers of start-ups and small young firms - such as the digital and information economies, or fields such as nanotech - coverage will be significantly poorer. The BSD is also limited in terms of information, only providing variables on age, employment count, industry, turnover and business address. Alternative sources such as Companies House provide much better coverage of economic activity, but contain important limitations of their own (see Section 3). The second issue is about SIC codes. SICs are designed to represent a firm's principal business activity, but also aggregate information about inputs and clients (Office of National Statistics 2009). As the OECD (2013) has noted, for niche or rapidly-evolving parts of the economy, SICs can be too broad or aggregated to shine much light. For this reason, firm counts for other or not elsewhere classified based SIC cells are often much bigger than for others close by in sector space, even at the most detailed five-digit level. In the 2011 BSD, for 8

8 example, the second largest ICT cell is 'Other information technology service activities' (62090) which contains 22,444 enterprises (compared to 66,090 in 'Information technology consultancy activities', cell 62020). 3 A third, related issue is that product categories both contain far more detail than sector cells, and these product categories often cross sector boundaries. In the OECD analysis software publishing, SIC 5820, contains 10 product/service groups; conversely, the products 'data transmissions services' and 'broadband internet services' are present in multiple SIC cells (6110 through 6190). Cross-sector product types are even more prevalent in digital content activities (OECD 2011). Taken together, these issues mean that mapping the extent and characteristics of firms in the digital economy using conventional sources and industry information alone is challenging - because of the nature of these firms, constraints on conventional data sources and on purely sector-based classifications. Big data sources and analytics have the potential to bring helpful clarity here. 2.3 Big Data Big Data is a complex concept that needs careful specification. A popular but seemingly circular definition says that big data is datasets too large for conventional analysis (Dumbill 2013). Instead we follow Einav and Levin (2013), who define big datasets as those that a) are available at massive scale, often millions or billions of observations; b) can be accessed in real time, or close to it; c) have high dimensionality, including phenomena previously hard to observe quantitatively, and d) are much less structured than conventional sources, such as administrative data or surveys. The use of such datasets and associated analytical techniques web scraping, text mining and statistical learning is growing in the social sciences (King 2013; Varian 2014). Well-known examples include analysis of internet search data (Askitas and Zimmermann 2009; Ginsberg, Mohebbi et al. 2009; Choi and Varian 2012); proprietary datasets, such as those derived from 3 In our main dataset, which is based on Companies House, the relevant counts for 2011 are 42,491 and 65,072 quasi-enterprises, respectively. Again, these are the two largest cells. 9

9 mobile phone networks (Di Lorenzo, Reades et al. 2012); and material derived from texts, both historic (Dittmar 2011) and contemporary textual information taken from the Web, political speeches, social media or patent abstracts (Gentzkow and Shapiro 2010; Lewis, Newburn et al. 2011; Couture 2013; Fetzer 2014). Structured administrative datasets also take on big features when linked together or enabled with API functionality, allowing researchers to call the data more or less continuously. In the UK, virtual environments such as the Secure Data Service (SDS) and HMRC DataLab provide researchers with secure/monitored spaces for matching exercises, 4 and a number of government agencies are introducing API functions for data stored online. In theory, these sources, tools and platforms should help us to develop much stronger measures of the extent and characteristics of digital economy businesses (and other nascent high-value sectors such as clean technology). Our dataset, for example, is built on an APIenabled 100% sample of active companies in the UK which is updated daily, and combines both public (administrative, structured) and proprietary (unstructured, modelled) layers which are matched to the base layer using firm names and other company-level details. The speed, scale and dimensionality should allow us both better coverage of businesses, clearer and more detailed delineation of product / sector space, and richer information on business characteristics. In turn, this promises more reliable analysis, which should lead to development of more effective policies. Conversely, big data approaches may turn out to have significant limitations for academic and policy-focused research. Einav and Levin (2013) discuss two of these: limits on access to proprietary datasets, and the potentially limited relevance of much business data to public policy-focused research questions. Other issues include coverage (for instance, of companies not present in scraped/mined sources), reliability (when variables are probabilistic rather than directly observed), and overall quality (proprietary datasets may not be validated to the standards of administrative sources, or at all). Our experience highlights many of these pros and cons. 4 See and (both accessed 1 December 2013). 10

10 3. The Growth Intelligence dataset Our main dataset is company-level information provided by Growth Intelligence (growthintel.com). Growth Intelligence (hence Gi) is a London-based firm, founded in 2011, that provides predictive marketing software to private sector clients. The Gi dataset is unusual in the big data field in that it combines structured, administrative data and modelled information derived from unstructured sources. The simplest way to describe the data is in terms of layers. This section provides a summary: more details are available in Appendix Companies House layer The base layer comprises all active companies in the UK, which is taken from the Companies House website and updated daily. Companies House is a government agency that holds records for all UK limited companies, plus overseas companies with a UK branch and some business partnerships. Registered companies are given a unique CRN number, and are required to file annual tax returns and financial statements, which include details of company directors, registered office address, shares and shareholders, company type and principal business activity (self-assessed by firms using SIC5 codes), as well as a balance sheet and profit/loss account. In some cases companies also file employee data (as part of the accounts, or when registering for small / medium-size status which carries less stringent reporting requirements). Coverage of revenue and employment data in Companies House is limited around 14% of the sample file revenue data, and 5% employment data. For this reason, descriptive results should be interpreted with some caution. 3.2 Structured data layers Gi match Companies House data to a series of other structured administrative datasets. In this analysis we focus on two of these. Patents data is taken from the European Patent Office PATSTAT database. Patent titles and abstracts are obtained from the EPO API feed and combined with the raw data. We also use UK trademarks data, which is taken from the UK Intellectual Property Office (UK IPO) API feed. 5 Gi use these structured datasets in two 5 Patents and trademarks matching is done on the basis of name and address information. We are grateful to the UK IPO for use of a recent patents-companies crosswalk, which we deploy alongside Gi matching. 11

11 ways: to provide directly observed information on company activity (for example, patenting), and as an input for building modelled information about companies (for example, text from patent titles as an input to company sector / product classifications). 3.3 Proprietary layers This part of the Gi dataset is developed through 'data mining' (Rajaraman and Ullman 2011). Gi develop a range of raw text inputs for each company, then use feature extraction to identify key words and phrases ('tokens'), as well as contextual information ('categories'). These are taken from company websites, social media, newsfeeds (such as Bloomberg and Thomson Reuters), blogs and online forums, as well as some structured data sources. Using workhorse text analysis techniques (Salton and Buckley 1988) Gi assign weights to these 'tokens' which indicate their likelihood of identifying meaningful information about the company. Supervised learning approaches (Hastie, Tibshirani et al. 2009) are then used to develop bespoke classifications of companies by sector and product type, a range of predicted company lifecycle 'events' (such as product launches, joint ventures and mergers/acquisitions) and modelled company revenue in a number of size bands. Tokens, categories and weights are used as predictors, alongside observed information from the Companies House and structured data layers. More information is provided in Appendix 1. The Gi dataset is complex. For this proof of concept paper we use the Companies House 'base layer' plus a selection of Gi s modelled variables (in-house sector and product classification, plus modelled revenue); in addition to these off-the-shelf variables we also use 'raw' web tokens and token categories for exploratory text-mining exercises on parts of our sample. 3.4 Pros and cons of a Big Data approach The properties of the Gi dataset should allow it to deal with the three measurement challenges outlined in Section 2. First, compared to administrative data sources like the BSD, the Gi data has greater coverage of economic activity and provides substantially more information (thanks to the matched and modelled layers outlined above). Second, the additional dimensionality in company classification should allow us a more precise delineation of companies providing ICT products and services. Specifically, SIC5 codes provide 806 sectors 12

12 in which to place companies, but Gi's 145 sector and 39 product groups provide 5510 possible sector-product cells, a more than six-fold increase. Being able to examine products, sectors and token-level information within sector-product cells affords additional detail than administrative sources and SICs cannot provide. Third, because many of Gi's sources are available in real time or close to it, the company can regularly update its data and track switches in company characteristics, such as pivoting from one product type to another. Conversely, there are some potential limitations in the Gi dataset. First, coverage of online sources is not perfect. Many companies in the UK do not have a website, for example, and not all websites can be successfully scraped due to site content or build. While 'non-scrapability' is likely random, having a website is not. Of course, a large number of companies without websites will be inactive or connected to an active enterprise that is online; we clean these 'untrue' companies out of our estimation sample (see Section 4). For the rest, Gi's modelled variables also draw on a range of online and offline sources for modelled data, which further helps deal with potential bias. Very few companies have no observed or modelled information at all: these comprise less than 0.1% of the raw data, and are dropped from our sample. Second, while the company has conducted some validation exercises on its modelled variables (see Appendix 1) Gi's core code is proprietary, which limits our availability to do forensic quality checking. However, we are able to conduct our own checks by comparing estimates derived from Gi's modelled data against those derived from directly observed information. Section 4 gives more details. 4. Building a benchmarking sample Our raw data comprises all active companies in the UK as of August 2012, and comprises 3.07m raw observations, of which 2.88m have postcodes. From this we need to build a sample that a) corresponds as closely as possible to the underlying set of businesses, and b) allows comparisons between information economy estimates based on SIC codes and those based on modelled big data. Our cleaning steps are as follows. 13

13 First, this 'benchmarking' sample can only include observations with both SIC codes and Gi classifications. Because around 21% of companies in the raw data are missing SIC information it will therefore be smaller than the 'true' number of companies. In some cases, we can crosswalk SIC fields from the FAME dataset to reduce losses. Overall, these steps reduce our sample from 2.88m to 2.85m observations. Second, we drop all companies who are non-trading, those who are dormant (no significant trading activity in the past 12 months), dissolved companies and those in receivership / administration. We keep active companies in the process of striking off, since a) most still operate and b) some will have failed to file returns but may re-emerge in the market under a different name. These steps reduce our sample to 2.556m companies. 6 We also drop holding companies from the sample, which reduces it to 2.546m observations. Third, we build routines to identify groups of related companies, and reveal the underlying structure of businesses. Companies are legal entities, not actual firms, so this is a crucial step to avoid multiple counting in the underlying firm structure (for instance, if company A is part of company B, it may include some of B's revenue / employment in its accounts). This step is necessarily fuzzy, as we are creating 'quasi-enterprises'. We do this in two ways, both of which deliver very similar results. Our preferred approach is to group companies on the basis of name (same name), postcode of registered address (same location) and SIC5 code (same detailed industry cell). 7 Within each group thus identified, we keep the unit reporting the highest revenue (as modelled by Growth Intelligence). Note that for the purposes of benchmarking, we are required to do the industry matching on SIC code. This procedure gives us a benchmarking sample of 1.94m quasi-enterprise-level observations. 8 6 Dropping non-trading companies removes 92,929 observations; dropping dormant companies removes 106,589 observations; dropping all but active and partially active companies removes 318,906 observations. Some companies may be in more than one of these categories, so sub-totals may not sum. 7 We do not use the full company name, but we use the first if there is only one word in the name of if the second word is some common acronyms that refer to the status of the company (Limited, Ltd, Plc, Company, LLP) in all their forms. We use the first and the second words if there are at least two words in the name or the third word is again an acronym as in the previous case. 8 We test the sensitivity of this approach by matching on postcode sector (that is, the first 4/5 digits of the postcode) rather than the full postcode. This less restrictive approach would reduce false negatives (related companies that are very closely co-located but not present at exactly the same address), but might increase false positives (similarly-named but non-related companies in the same industry and neighbourhood). Results show 14

14 We also test an alternative approach that exploits corporate shareholder information matched from FAME. The intuition is that if company A owns more than 50% of company B, A is likely to report B's revenue and employment. We drop B from the sample in these cases. This approach gives us a benchmarking sample of 1.823m observations. Headline results from this alternative approach are in line with our main results set out in Section 6. 9 We validate our cleaning steps by constructing a 'true' sample of all quasi-enterprises, this time including all the companies dropped because of missing SIC codes. We then compare this against counts of actual enterprises in a) the 2011 BSD and b) the 2012 UK Business Population Estimates (the most recent available at the time of writing). The BSD contains 2.161m enterprises, but excludes sole traders and many SMEs. Our true sample of quasi-enterprises contains 2.460m observations as of August 2012, so the BSD figure is within 88% of this: acceptable given the differences in time and sample coverage. The BPE is a more helpful benchmark since it combines BSD enterprises with estimates for non-bsd businesses and sole traders (some of whom will be in our sample if they have registered a company). The BPE gives estimates up to January 2012; to make the comparison cleaner we estimate an August 2012 figure. We include companies, partnerships and sole traders with employees, plus 10% of other sole traders as a proxy for single-owner registered companies. This gives a January 2012 baseline of 2.36m enterprises. We project the August figures based on smoothed trend: this gives a figure of 2.45m businesses, within 99% of our true sample estimate. 10 We also test the robustness of our benchmarking sample structure. This is important to explore, as firms registering at Companies House assign themselves a SIC code. Companies doing novel activities not well covered in SICs might systematically select into not elsewhere that company counts decline in almost the same proportions across all sectors. This is reassuring, as it implies that there is nothing systematic happening in our selection process. Details are available on request. 9 Specifically, using SIC-based definitions we have 158,810 ICT producer companies (8.17%) compared to 225,800 companies (11.62%) using the sector-product approach. See Table 2 for headline comparisons. 10 The 2.36m figure includes 1.34m companies, 448,000 partnerships, 297,000 'sole proprietorships and partnerships' with employees and 271,000 sole traders without employees. We also conduct sensitivity checks including 1) 5% of sole proprietors without employees (2.253m enterprises) and 2) basing on trends (2.390m enterprises). Full results available on request. 15

15 classified SIC bins rather than their true classification. The set of information economy SICs contains quite a lot of these, which might lead to upwards bias. Conversely, selfassignment might lead to missing SICs for information economy firms, leading to undercounts. Specifically, we compare across all five-digit SIC bins in Companies House with those in the 2011 BSD. Appendix 2 sets out the analysis. We find that the different population frames of the BSD and Companies House produce some differences in levels and internal structure, reflecting real differences in company and sector characteristics, such as firm age, industry structures and entry barriers. The overall distribution of Companies House and BSD SIC5 bins is well matched. Around the extremes, we find a number of not elsewhere classified type bins where Companies House counts are higher than the BSD. These bins account for just over 10% of all the data, but only four out of 74 of these bins are in the information economy. Conversely, 21.5% of observations in the Companies House raw data lack SIC codes altogether. Taken together, this suggests that any Companies House processes (such as self-assignment) could be generating a small amount of upwards bias, but this is more than outweighed by the likely downwards bias produced by non-assignment. 5. Identifying ICT production activity Our benchmarking sample consists of nearly 2m 'quasi-enterprises' classified with both SIC codes (based on company self-assessment), and Gi's sector and product categories (based on a range of observed and modelled information). We now use basic industry-level information economy categories (from SIC codes), and exploit the additional richness and dimensionality in our 'big data' to develop alternative counts of information economy firms. Our identification job is analogous to studies that seek to map a social/economic phenomenon through analysis of structured and unstructured information, both in data mining and in related fields such as bibliometrics. While these studies have important differences, they share many of the same basic steps. Each begins with a given vocabulary or item set K x describing the phenomenon X, and which is used to analyse a much larger item set, U x, for 16

16 which information about X is unknown. Items in K x may map directly onto U x, or common features - such as distinctive terms in both K x and U x - may be used to generate a mapping. For instance, Porter et al (2008) deploy bibliometric analysis of academic publications to identify the contours of nanotech research. Specifically, they construct a 'core' set of nanotech publications that is then verified by experts, and use keywords for these publications to build a two-stage Boolean search algorithm that can be run on databases of academic papers or patents. Gentzkow and Shapiro (2010) use speeches by members of the US Congress to analyse ideological 'slant' in the American media: they develop a core vocabulary of liberal and conservative politicians' most distinctive phrases, which is then mapped onto a similar vocabulary of newspaper op-ed pieces in order to estimate media affiliation. Working with patents data, Fetzer (2014) uses existing technology field codes to delineate broad spaces for 'clean' technology, then generate finer-grained technology vocabularies from patent titles and abstracts. These are then used to resample the patents data to provide an alternative mapping of the clean technology space. Ideally, then, we would look for a rich word- or phrase-level objective vocabulary for information economy companies, K ie, which we would then map onto a corpus of texts for companies in our benchmarking sample. In practice, we have a category-level starting definition of the information economy from the UN/OECD definition and their UK variants (see Section 2). However, in our data this is only available for industry sectors - and with some disagreement among policy actors about field boundaries. And rather than raw words and phrases, we are working with larger, off-the-shelf sector and product categories (see Section 3). We therefore use this 'categorical vocabulary' as a starting point for our analysis. We are also able to compare estimates for the information economy done with conventional industry codes (based on company self-assessment), to those done with Gi's sector and product categories (based on a range of observed and modelled information). 5.1 Mapping strategy Our basic mapping steps are as follows. First, we take the sub-sample of companies with OECD/BIS ICT products and services SIC codes, as defined in Table 1. Next, we extract the 17

17 corresponding Gi sector and product classifications for those companies: this provides a longlist of 99 Gi sectors and 33 Gi product groups. We treat this as a rough cut of the true set of ICT sectors and products/services. Following this, we refine the cut. We first use a crude threshold rule to exclude 'sparse' Gi sectors and product cells, which might be marginal and/or irrelevant to ICT sector/product space. Sparse groups are defined as those present in less than 0.2% of the long-listed observations. Kicking out the long tail of sparse cells results in a shortlist of 16 sectors and 12 product groups, which account for the majority of ICT-relevant observations. Next, we review the sparse Gi sector and product lists in detail to recover any marginal but relevant cells. By construction, each of these cells comprises less than 0.2% of the long-listed observations. 11 The review is rule-based: specifically, we look for sparse Gi sector or product cells where the sector or product name corresponds to 1) the OECD definition of ICT products and services, or 2) BIS modifications to this list. We use the detailed OECD guidance (OECD 2011) and Gi metadata to guide marginal decisions: we include cells that have some correspondence to the OECD-specified SIC4 or CPC group, and exclude those where no such correspondence exists. For example, we recover the Gi sector cells computer network security and e-learning, which features in the OECD product list, but exclude the product cell hardware tools machinery, which Gi use to designate construction tools (such as mechanical hoists). Finally, we use this set of sectors and products to resample sector-by-product cells from the whole benchmarking sample. This creates a set of companies in 'ICT' sectors whose principle product / service is also ICT-relevant. 5.2 Identification This 'sector-product' approach, built on a range of data sources, should provide a better mapping of information economy firms than using self-ascribed industry codes alone. Specifically, it should allow us to deal with false negatives in our data (via incorrect SIC 11 We include the following sectors: e-learning, computer network security, information services, semiconductors. We include the following products: software web application and software mobile application, but we exclude: hardware tools machinery. 18

18 coding). It should also tackle false positives, by allowing us to identify the set of companies in 'ICT' sector contexts whose main outputs (products and services) are also ICT-related, disregarding those who are not involved in digital activity. This is allows us to keep those companies in (say) the mobile telecoms industry who are actually making mobile phones, and exclude those who are involved in wholesale, retail or repairs. To make this analysis robust we also need to deal with some potential problems. First, our starting categories are not completely fixed; as outlined in Section 2, there is some disagreement about which SIC codes should be used to delineate the information economy (recall that while some industry analysts want a very small set of SICs covering production, others believe that a wider set of ICT supply chain industries should be included). This means that the sector-product results (specifically, the set of false negatives) might be endogenous to the set of starting industry cells, rather than being driven by real differences in sector-product information. To deal with this, we reproduce the analysis with different SIC starting sets, both a very narrow set of ICT service industries and a broader set of manufacturing, service and supply chain industry bins. Second, we might worry that our 0.2% threshold rule still identifies some irrelevant sector / product space (leading to false positives). We therefore experiment with tighter thresholds at 0.3% and 0.5% of long-listed observations. Third, we might worry the sector-product approach may collapse to a 'sector' or 'product' analysis, if one of the Gi vectors turns out to be uninformative. In this case false positives could be included in the final estimates. We test this by reproducing the analysis with Gi sector cells alone, and Gi product cells alone. A final worry is that our off-the-shelf Gi categories are too high-level to always provide useable information. Note that this objection also applies to SIC codes, as we discuss in Section 2. In our case, we are relying on the combination of sector-by-product information to provide extra dimensionality across the pooled sample, but analysis using only Gi sector or product typologies, or individual sector/product cells may be less informative. We therefore use raw token information from company websites to look inside the largest sector and product cells, providing additional descriptives. 19

19 6. Results 6.1 Headline counts and shares How do conventional and big data-based estimates of ICT production differ? Table 2, below, gives headline results. Table 2 about here Panels A and B give sector-based and sector-product based estimates of information economy companies, based on SICs in Table 1 and GI sector-product cells respectively. Sector coding identifies 158,810 ICT quasi-enterprises, 8.17% of our benchmarking sample. By contrast, the sector-product approach identifies 225,800 quasi-enterprises, around 11.62% of the economy. That is, our big data-driven estimates in panel B are 42% higher compared to SIC-based definitions in Panel A. Overall, this difference in headline numbers around 70,000 missing companies not in the SIC-based estimates but in the Gi-based estimates suggests the precision gain is non-trivial. By construction, our sample includes only those companies with SIC and Gi coding, so missing SIC codes are not driving the results. This also implies that the true numbers of information economy firms is likely to be higher than the counts here. Other panels report robustness checks that explore some of the identification challenges discussed in section 5.2. Panels C and D show sector-based estimates when changing the starting set of SIC sectors. As we discuss in section 1, stakeholders disagree over the 'real' scope of the information economy, with some favouring broader or narrower definitions than BIS have chosen. Therefore in Panel C1 we look only at SICs covering ICT services, while in Panel D1 we use a broader definition of the information economy that also includes SIC codes in the wider ICT value chain. 12 Panels C2 and D2 give corresponding Gi-based 12 Panel C covers ICT services only (see Table 1). Panel D includes all the SICs in Table 1 plus (Repair of machinery), (Repair of other Equipment), (Repair of Electrical Equipment), (Installation of industrial machinery and equipment), (Repair of computer and peripheral equipment), (Other engineering activities), (Engineering related scientific and technical consulting activities), (Engineering design activities for industrial process and production). 20

20 estimates. If our main results were entirely driven by choice of the SIC starting categories, we would find alternative SIC (sector-based) counts converging to the Gi (sector-product) estimates in Panel B. Even with the broadest starting set of SICs (Panel D1) we find 31,624 fewer companies than our baseline Gi estimates (Panel B) and 40,058 more companies in the corresponding Gi counts (Panel D2). While this highlights the importance of how the set of information economy businesses is initially defined, our main results survive albeit with a smaller set of missing companies unearthed. Panel E tests the effectiveness of the sector-product approach as opposed to using sector-only information. We would expect the lack of granularity to produce higher estimates, which it does (305,177 versus 225,800 companies, almost 16% of the sample). (Using only the product dimension the share would be driven up to more than 50%.) 13 The last two panels shows estimates using more conservative threshold rules to exclude sparse Gi sectors and products cells: 0.3% and 0.5% in panels F and G, respectively. Again, we would worry if the resulting counts approached the initial sector-based estimates in Panel A (indicating that the sector-product approach delivers little precision over SIC sectors. Information economy counts and shares drop as expected, but even in the most conservative specification (Panel G) we find 34,597 additional companies using sector-product cells compared to SIC sector codes. 6.2 What kind of additional companies? Our sector-product method gives us a large number of companies that we would not treat as ICT producers using SIC codes alone. To illustrate the difference, Table 3 maps these quasienterprises back onto their SIC codes, for the 18 largest SIC cells. 14 Table 3 about here Note that some of these SIC bins (specifically, and 95110, 4.8% of the total) would be included in our broad-based set of information economy SIC codes, as discussed in the 13 Results available on request. 14 We conduct the same exercise mapping back to SIC using different ICT SIC definitions. Results are available on request. 21

21 previous section. Another 8% (33190, 43210, 46250, 47410) also fit into value chain space. However, more than 26% of the omitted companies classify themselves in the 'Other engineering activities', 'Engineering related scientific and technical consulting activities' and 'Engineering design activities for industrial process and production' bins (respectively SIC codes 71129, 71122, 71121); and another 20% define themselves in the advertising agency or specialised design sectors (such as or 74110). While these companies are in non-ict sector contexts, in other words, their principal products and services put them into the information economy. 6.3 Internal structure of the ICT producer space Next, we take a closer look at the internal structure of our Gi-based ICT producer estimates. Tables 4 and 6 provide headline counts, shares and revenue information for the largest sectorproduct cells. Each table rotates the cells to indicate sector information (Table 4) and product information (Table 5), so that companies in (say) the computer games sector could have any of the principal outputs listed in the products table and companies whose principle product is (say) consultancy might be in any of the sector cells in the sector table. In principle, then, we could construct a very large matrix of all 378 sector*product combinations. Table 4 about here More than 46% of companies in Table 4 are located in information technology, almost 15% in computer-related sector groups (computer software, hardware, games), around 20% in engineering and manufacturing sectors, and a further 7% in telecommunications. Table 5 about here Table 5 shifts the focus to products and services. Most of the companies are providing some kind of consultancy service (67%), offering software development (8.8%), care and maintenance (7%), web hosting (just under 3%) or some sort of broadband or software related services. 22

22 Finally, we use text mining on website information for a sub-sample of companies to uncover more information about the largest cells, information technology and consultancy. 15 As set out in Section 3, Growth Intelligence scrapes website text and uses machine learning to uncover key words and phrases (raw tokens ), and contextual information for each token ( token categories ). Gi reports 12 token categories of which we use four organization, product, technical term and technology most likely to describe the nature of the company, the technology used and the type of product. 16 Tokens in these categories are assigned a value representing the relevance of the token for the company, ranging from 0 to 1. We include only tokens whose company relevance is above 0.2. This raw token information needs to be cleaned: we harmonize the words that appear in the tokens, by putting all the words into lower case, removing punctuation, and removing words that may refer to legal status of the company: ltd, plc, llp, company. We also remove some English stopwords following an existing vocabulary. 17 In Figure 1, we report, in a word cloud, the most popular words across the whole set of information economy firms when the sector is defined using the Growth Intelligence classification as per Panel B in Table 2. For reasons of space, we only show the words that appear at least 2,000 times in the whole sample of the information economy (26,408 companies). 18 We end up with a list of 363 words where the total number of words is 1,839,014. The larger and darker the word is, the more frequent it appears in the sample of companies in the information economy that report token information. For example, the most 15 We have run some statistical tests in order to check how different the sample of tokens is in comparison to the whole sample of companies (benchmarking sample), both in terms of within sectoral distribution (share of ICT companies) and in terms of characteristics to conclude that the information economy sector when defined using SIC codes is around 8% (similarly to the whole sample). When defined using Gi definition the information economy is slightly overrepresented in the token sample, it is likely to be the case as Gi algorithms puts more weight to the presence of web tokens when assigning a company to a sector. Sectors/products where token information is better (in particular it is likely that ICT sectors do have a better internet coverage) are likely to be larger. In terms of characteristics, ICT companies in the token sample are likely to be older, and have higher revenues. All the differences are statistically significant. 16 The full list of token categories is: Company, Contact Details, Entertainment Event, Location, Operating System, Organization, Person, Position, Product, Technical Term, Technology, TV Show. 17 accessed 15 December This threshold can be modified to higher or lower frequency. 23

23 frequent word is 'technology' which appears 70,139 (4% of the total number of words) in the sample, the word 'technology_internet' is very frequent and appears 40,286 times (2%). Figure 1 about here In Table 6 we report a list of the most popular words (48% of total number of words) in the information economy with the total number of 'appearances' and the relative share given by the number of appearances over the total number of words (1,839,014) (Panel A). We also show the same information for the companies in the sector-product cell 'information technology-consultancy' (Panel B), product cell 'consultancy' across all ICT sectors (Panel C) and for the sector cell 'information technology' across all ICT products (Panel D). 19 Table 6 about here Results show that the word that appears the most across panels A, B and C is 'technology', while for the IT sector alone it is software. The former represents 4% of the total number of words in the complete ICT producer space (Panel A), 7% in the 'IT-consultancy' sectorproduct cell, 5% in Panel C (consultancy products) and 6% in Panel D (IT sector), while 'software' in IT appears in 7% of cases. Even more interesting is that the distribution across panels within these information economy cells is very similar, and despite being relatively sparse, with some words appearing only 1% of the time, we observe a high density in the same words across all four panels. To understand how distinctive these words are, we also look at the word distribution in the rest of the economy: we might worry, for example, that these are simply terms which appear on any company s website. Interestingly, we find that the most relevant words are not the same and actually the words that are denser in ICT production space are under-represented in other activity spaces (Figure 2). Figure 2 about here 19 In the subsample of companies with tokens we have 3,716 companies doing IT and consultancy, 12,556 companies providing some consultancy service in any ICT sectors, and 4,296 in the information technology sector (any ICT products). 24

24 7. Characteristics of ICT and non-ict businesses This section provides more detailed information on companies age, inflows, revenues and employment. Not all companies report revenue or employment data, so these latter analyses are done on suitable sub-samples. While some companies have no revenue or employees to report, there are also some holes in the Companies House data. 20 We perform a range of diagnostic checks to make sure the sub-samples are representative, but data limitations mean that revenue and employment information has to be interpreted with some care. 7.1 Age Table 7 reports the average age of ICT and non-ict companies in the benchmarking sample. 21 Using SIC codes, ICT companies around almost three years younger than non-ict firms; using sector-product definitions the difference shrinks slightly. Notably, median differences between ICT and non-ict firms are substantially smaller; the median ICT firm is now about a year younger than its non-ict counterpart, whichever definition is used. Table 7 about here In Table 8, we show the distribution of companies by age groups. This share can easily be interpreted as a survival rate as nothing is revealed about the actual turnover rate of companies. 22 Panel A uses SIC code definitions; panel B uses sector-product groups. In Panel B, around 66% of 'ICT' companies are under 10 years old, 33% under five years, 14.4% under three years old and around 1% less than a year old. This compares with 64.6%, 30.6%, 13.8% and 2.2% respectively in the rest of the economy. Analysing the distribution using SIC codes (Panel A) shows very similar patterns. Start-ups, usually defined as companies less than three years old, are slightly more common among ICT producers than in the rest of the economy 20 Some companies will not file annual returns or accounts on time; others may file incomplete information; others may fail to declare revenue. Companies House may have limited resources to chase up offenders. 21 We report estimates only for our preferred definition, panels A and B of Table We have looked at companies that dissolved in year 2012, which have dropped from the selected sample. We have looked at the distribution of companies by incorporation year and by sector and also in this case, the distribution over time is similar in the ICT sectors and in the rest of the economy. This also implies that the average age is similar and it is actually higher for the digital economy sectors when using Gi definition. 25

25 Table 8 about here On the face of it, these findings are surprising. The popular image of the ICT industry is of start-ups and very young companies. Our evidence, however, suggests that there is no reason to think that the ICT companies are more ephemeral than the other companies. Our analysis of inflows, below, also tells a similar story. 7.2 Inflows Figure 3 shows the inflow of companies into the economy, comparing inflows of companies into ICT production (dashed line) with companies in the rest of the economy (solid line), from 1980 to The number of ICT companies entering the economy every year has always been much smaller, but it is interesting to see that when using Growth Intelligence's classification we are able to capture a higher level of inflow over the whole period considered but in particular after the year Figure 3 about here We also estimate the growth rate, defined as the percentage of the yearly inflow over the total existing companies and compare it across the two sectors. Results are shown in Figure 4. Figure 4 about here Two things are worth noting. First of all, the growth rate of ICT companies has been higher than the rate in the rest of the economy in the period before the dot-com bubble which happened in year 2000, and this is even more evident when using the SIC codes. The reason why the rate is smoother in the Gi-based classification may be related to the fact that when using our alternative definition we are also capturing companies that have been in the 23 Company reclassification may be more pronounced over longer periods: this will not be captured in SIC codes, which in Companies House are ascribed when companies are set up. Growth Intel s more up to date information may be buying us extra precision here. 26

26 economy for a longer period and started to produce products or provide services that we would include in the ICT definition. 7.3 Revenue As discussed in Section 3 and in the Appendix, regular Companies House data provides relatively limited information on company revenues. Only 13.9% of the companies in our sample have reported revenues in the period between 2010 and 2012 and even a smaller percentage (8.4%) have filed revenues every year over the same period. We therefore supplement this information with Gi s modelled revenue data, which covers all of the companies in the dataset. Table 9 about here Table 9 sets out these two sources together. We can see from Panel A that the sub-sample of companies reporting revenues is similar to the full sample in terms of information economy shares. For this sub-sample, non-ict companies have higher average and median revenues, but on Growth Intelligence s measures the gaps between the two groups narrow substantially. When shifting to modelled revenue, ICT firms have lower average revenue but rather higher median revenue than non-ict firms. In Panel B, we look at revenue growth for companies who report revenues to Companies House over more than one year. The first column reports the average percentage growth, defined as the within-firm growth of revenues averaged over the sample. On the sector-product basis, growth is higher for ICT companies (22%) than the rest of the economy (15%) with similar results for SIC-based definitions. Median differences are rather smaller. Table 10 about here Table 10 takes a higher-level view of modelled revenue across the whole benchmarking sample. Average revenues for ICT firms run at around 40% of the non-ict average for SIC definition but slightly higher on the sector-product. Looking at medians, non-ict firms have slightly lower modelled revenue than ICT firms using both SIC and sector-product cells. Again, levels differences between means and medians are substantial, suggesting the presence of outliers. 27

27 7.4 Employment Under Companies House rules, companies are only obliged to report employment data in specific cases: in our raw data, only 100,359 companies provide this information. As with revenue, this will be a selected sub-sample, so we run checks to determine the shape of the bias. 24 We would expect companies with employees to be older and have higher revenues than those without, and this turns out to be the case: those in the employment set are on average twice as old, and report average modelled revenues around 2/3 higher than the nonemployment set. These caveats should be borne in mind in what follows. On the other hand, tests of industrial structure suggest very similar shares of ICT and non-ict companies and the spatial distribution of the companies across the UK is very similar, with three out of the top five locations being shared. Table 11 about here First we look at employees per firm. Table 11 shows average and median employees per company. As not all companies report employment in every year, we smooth the data across three and five-year periods. Average employment counts for ICT businesses differ substantially between SIC and Gi-based definitions. Using SIC codes, non-ict businesses are somewhat larger and ICT firms, and a little bigger than the average firms. Using sectorproduct definitions, ICT firms employ rather more people on average than companies in the wider economy and the average firm, especially in the period. However, median differences are much smaller, with non-ict firms consistently reporting higher worker counts. That suggests outliers explain much of the mean differences. Table 12 about here Next, we turn to ICT firms' share of all employment (for which we have information). Table 12 shows that shifting from SIC-based definitions of information economy businesses to Gi definitions shifts ICT firms' employment share substantially upwards, from around 3.5% to nearly 12% of all jobs in , and from 3.7% to 8.92% in This is as we would expect, since underlying company counts are higher in our big data-driven definitions. 24 Full results are available on request. 28

28 7.5 Location To get a sense of how the information economy is distributed across the UK, we geo-code individual companies into Travel to Work Areas (TTWAs). TTWAs are designed to represent functional labour markets, and are generally considered to the best available approximation of a local economy. 25 Our analysis is using 'quasi-enterprises' rather than individual plants, and using the registered addresses of those companies. This needs to be borne in mind in interpreting the results. First, in most cases the registered address of a company will also be their trading address, but not in all cases. 26 Bundling companies into TTWAs minimises the chances of putting companies in the wrong part of the country. Second, using registered addresses is also likely to lead to a more big-city-centric distribution - since London and large urban cores are more likely to contain company headquarters than TTWAs with smaller cities, or rural areas. Geo-coding also slightly shrinks our benchmarking sample, from 1.94m to 1.936m companies. This is because not all company addresses provided to Companies House include postcodes, and because some companies provide PO Box addresses (where the postcodes are not assigned to a particular geography). We first look at the distribution of companies around the country (figure 5). The left hand map maps the UK's Travel to Work Areas and shows banded counts of information economy firms, using the Gi-based sector-product measure. We have divided the counts into quantiles, each of which represents 25% of the observations, plus a separate London band. Figure 5 about here 25 Formally, at least 75% of those living in a given TTWA also work in that TTWA, and vice versa. 26 Gi have collected experimental trading address postcodes for a sub-sample of 316,884 companies, using postcode data from company websites and phone directories. This data is very noisy and should be treated only as a fuzzy estimate (issues include false positives for common company names and 'false missings' if websites are non-scrapable or not provided). 257,358 companies in our benchmarking sample (13.6%) have trading address data. Of these, 216,349 (84.07%) have only one trading address. Identical or co-located registered and trading addresses for same-named companies are very likely to represent the same company. In 97,629 cases (45.31%) the full trading postcode is the same as the registered address; for 111,183 cases (51.39%) the trading address is in the same 3/4-digit postcode sector as the registered address; for 149,426 companies (69.35%) the trading address is in the same 2-digit postal area as the registered address. 29

29 The information economy is very spiky, with a lot of co-location in London, Manchester and the Greater South East. Using our preferred sector*product measure, the 10 TTWAs with the most digital economy companies are London (58,248 companies), Manchester (7,582), Guildford and Aldershot (6,172), Birmingham (5,384), Luton and Watford (4,578), Reading and Bracknell (4,091), Bristol (3,862), Crawley (3,827), Wycombe and Slough (3,483), and Brighton (3,376). Underneath this group are another 40-odd TTWAs with 1,000-3,000 information economy companies, followed by a very long tail: over 60% of the areas on the map have less than 500 companies, and 25% have under 100. Using SIC codes the top 10 TTWAs are very similar, although counts are smaller: London (43,802 companies), Guildford and Aldershot (4,825), Manchester (4,604), Birmingham (3617), Luton and Watford (3,592), Reading and Bracknell (3,405), Crawley (2,841), Bristol (2,803), Wycombe and Slough (2,670) and Brighton (2,668). Overall, around 80% of companies are in urban areas - defined as TTWAs with a city of at least 125,000 people - although this share will be higher than plant-level analysis, which would look at trading locations as well as registered addresses. Next, we use location quotients to get a sense of where the information economy is most locally clustered (in the sense of co-location). Location quotients compare the local area share of a group i to its national share. 27 Location quotients over 1 indicate local clustering; under 1 suggests dispersion. Results are shown in the right hand panel of Figure 5. Looked at this way, the spatial footprint of the information economy is rather different. Using our preferred Growth Intelligence-based metrics, the 10 areas with the highest location quotients are Basingstoke (1.84), Reading (1.78), Newbury (1.68), Milton Keynes (1.54), Swindon (1.51), Luton and Watford (1.43), Guildford (1.41), Middlesbrough (1.38), Wycombe and Slough (1.374) and Stevenage (1.372). Just outside this are Brighton (1.35), Coventry (1.34) and Cambridge (1.33). 28 Using LQs, then, highlights the importance of the digital economy to cities in the Greater South East, especially in the crescent of high-value 27 Formally, LQia = (pia / pa) / (pi / p), where pia / pa is the local population share of i in area a, and pi / p is i s national population share. An LQ of above 1 indicates concentration, or local shares above the national shares; scores below 1 indicate dispersion, or local shares below the national share. 28 Using SIC codes to define the top 10 information economy clusters, we find a broadly similar pattern orientated around the Greater South East. 30

30 activity that runs around the West of London, but also highlights some perhaps unexpected hotspots, both in the North East (Middlesbrough, 1.38 and Hartlepool, 1.21) and in Eastern Scotland (Livingston and Bathgate, 1.20 and Aberdeen, 1.11). Why don't we find places like London, Manchester and Birmingham in these lists too? Partly because these are large cities with diverse economies. However, we can use more detailed geocoding to look at very local clustering within these cities, particularly for young firms. Evidence suggests that large cities can act as 'nurseries' for start-ups (Duranton and Puga 2001), and we see some confirmation of this in our data. Table 13 shows the 30 postcode sectors with the largest counts of information economy start-ups (defined as companies up to three years old), and the corresponding counts of all start-ups and all information economy firms. 29 Table 13 about here Overall, the distribution of these young firms is highly uneven: over 32% of postcode sectors have no start-ups at all, and 56% have less than 10. The remaining areas contain over 93% of all start-ups. Five of the top 10 postcode sectors are in Central London, of which three are in East London (EC1 or EC2). The rest of the top 10 are in Brighton (BN36), Coventry (CV12), High Wycombe (HP11) and Poole (BH121). Figure 6 maps this postcode-sector activity across central London. We can see some of the familiar geography of Tech City (Nathan and Vandore 2014), but also other hotspots around London Bridge and Canary Wharf, as well as parts of the West End. (Remember that our mapping does not include digital content activity, so broader 'digital economy' counts will be rather higher than this.) Figure 6 about here 7.5 Patenting Information on companies' patenting activity provides a useful insight into IP and ideas generation. This section gives some headline descriptive findings. Our patents data covers 29 Postcode sectors comprise the first four / five digits of a postcode, for seven / eight-digit postcodes respectively. 31

31 European Patent Office (EPO) applications and is matched onto company data, using name and company/applicant/inventor location information. 30 We're interested in patents where the applicant is based in the UK, or where at least one of the inventors involved is UK based. The overall match rate from patents to companies is 65.4%, which is satisfactory. A number of patents will not match because the applicant is an individual rather than a company; where the applicant name field has errors; or when applicants are not in our benchmarking sample but may be in the wider Companies House data. The resulting matrix comprises 63,860 'raw' patents filed by 8,869 companies between 1978 and 2012; 108,316 inventors are named, of whom 85,498 are resident in the UK. Patents are organised by 'priority year', that is, the year in which they first entered the EPO or other patents office). A number of patents have more than one applicant: so to avoid doublecounting the analysis is done using weighted patents, where weighting patents with the number of applicants. Table 14 looks at company-level patent counts, pooled across years. Average counts are very small (Panel A) explained by the fact that most UK companies do not patent at all (see Hall et al (2013) for more on this). However, information economy companies tend to patent more than non-information economy businesses, whichever measure is used. In both cases, the differences are statistically significant once weighted patents are used. For the subset of companies with at least one patent (Panel B), information economy companies again patent more than those outside the information economy, but differences are not statistically significant. Table 14 about here While information economy companies are higher-patenting, they are in the minority both in the wider data and in the patenting sub-sample. So the majority of patenting is done outside the information economy, as the analysis below will highlight. 30 See OECD OECD (2009). OECD Patent Statistics Manual. Paris, OECD. for an overview of the use of patent data in economic analysis, and discussion of the EPO and other patent filing systems. At this stage we do not look at IPC filings, restrict to granted patents or weight patents by citations. All of these steps are feasible for future analysis. 32

32 Next, we look at the distribution of patents across technology fields, using the OST7 classification developed by Schmoch (2008). Table 15 gives the breakdown. Looking across all patents (Panel A), we can see that about 70% of activity is covered by the first four classes (electrical engineering and electronics, instruments, chemicals and pharma/biotech) with mechanical engineering taking the next largest share. By contrast, information economy companies' patenting is heavily orientated towards electrical engineering and electronics, followed by instruments (panels B and C). The spread across classes is more even when Gi measures are used. Table 15 about here Using Gi definitions, we can see that information economy firms undertake more than half of electrical engineering and electronics patenting, and around a fifth of instruments patenting (these fall to 45% and 16% when SIC-based definitions are used). However, note that this analysis is done on unweighted patents, so does not take into account the number of applicants per patent. Weighted distributions will differ depending on the extent of copatenting across technology fields, and inside / outside the information economy. Figure 7 about here We then turn to patenting over time (Figure 7). The top panel shows the overall distribution, weighted by applicants on the patent, plus the information economy trend. We can see a rapid growth in patenting overall, while information economy activity rises much more gently. The bottom panel looks at patenting across OST7 technology fields. We can see very strong growth in electrical engineering and electronics patents, some increase in instruments and in mechanical engineering, then weaker growth in other fields. The rapid growth in electronics and electrical engineering is partly driven by a spike in software patenting, which in turn is partly driven by changes in US IP legislation in the early 1990s (Li and Pai 2010). Figure 8 about here Figure 8 focuses on this technology field in more detail, and sets out the information economy share of activity. The top panel shows raw patent activity, the bottom panel activity with patents weighted by applicants. The raw patents analysis shows a very high share of 33

33 information economy activity in overall patenting, which corresponds with the breakdowns in Table 16. However, once we control for the number of applicants on each patent, the information economy share drops significantly. This suggests a substantial amount of copatenting by information economy businesses, which does not apply so much to other firms patenting in this technology field Trademarking As with patents, trademarks (TMs) also provide some indication of firms' intellectual property holdings and the innovative activity underlying this(mendonca, Pereira et al. 2004). There are also differences; patents typically indicate investments in technical knowledge, while trademarks are more closely associated with marketing strategy (Sandner and Block 2011). Specifically, while patents are granted for ideas developed, trademarks can be granted against future IP - for example, a name or slogan that may be used in the future for a product that does not yet exist. As measures of innovation, therefore, TMs are not clear-cut; as broader indicators of strategic IP activity they are very useful. At this stage in the analysis our trademarks data is a single slice of 14,637 trademarks live in , taken from the UK IPO journal and matched to companies in the benchmarking sample. The overall match rate is 61.5%, for 5559 companies holding at least one TM. Even taking non matches into account, this implies that the majority of firms in our benchmarking sample do not use TMs at all, a finding echoed in Hall et al (2013). Firms in the sub-sample are on average 12 years older than those outside, and have significantly higher average revenues. Trademarks are classified using 46 'NICE' classes, 32 and can be listed in multiple classes (although over 80% of our trademarks have three or fewer classes). For simplicity, we organise TMs into four mutually exclusive groups covering manufacturing, food and drink, 31 We test for the presence of patents which have an 'information economy' applicant and at least noninformation economy co-applicant. We find 257 occurrences (0.4% of all patents). 32 See (accessed 13 May 2014). 34

34 services, and hybrid (covering at least one of the previous three classes). We also identify technology-oriented trademarks within these groups. 33 Table 16 shows trademarking activity for , across the full benchmarking sample (Panel A) and the subset of firms with at least one live TM (Panel B). As noted above, the majority of firms hold no trademarks, so counts in the pooled sample are very low. Counts in the sub-sample are higher, with the average firm holding just over 1.6 trademarks (Panel B). Notably, overall holdings inside the information economy are always significantly smaller than outside, and smaller than the underlying sample average. This compares to patenting, where firms in the information economy hold more patents than non-ie counterparts (and the average firm). Table 16 about here Table 17 provides a summary breakdown of trademark groups. The top panel shows that 'manufacturing' trademarks comprise around 39% of marks, followed by 'crossover' TMs (just under 26%), services (just under 24%) and food and drink (around 11%). Technologyorientated TMs comprise 37.8% of the sample: the breakdown here is rather different, with crossover and services trademarks dominating. Table 17 about here Table 18 looks at company trademarking in these technologically orientated NICE classes. In contrast to the whole set of TMs, here we can see significantly higher trademarking by information economy firms both in the full benchmarking sample (Panel A) and in the sub-set of trademark-holding companies (panel B). Table 18 about here 33 'Manufacturing' covers NICE classes 1-28, 'food and drink' 29-34, 'services' Within these, 'manufacturing tech' covers NICE classes 7 ("machines and machine tools") and 9 ("scientific instruments, audio, video, computers"); 'services tech' covers NICE classes 38 ("Telecommunications"), 41 ("Education, training, entertainment"), 42 ("Scientific and technological services including software"). We find no technologically orientated NICE classes in the food and drink group. 35

35 The overall information economy share of these trademarks is 23.8% (on the Gi sectorproduct measure), versus 19.6% when the analysis is done on a sectoral basis. While information economy firms hold more of these trademarks than non-ie counterparts, they are a minority of firms in the overall benchmarking sample. 8. Discussion Governments around the world want to develop their ICT and digital industries. To do this effectively, policymakers need a clear sense of the size and characteristics of these businesses which as we have shown, is hard to do with conventional datasets and definitions. This paper uses innovative big data resources to perform an alternative analysis, focusing on ICT producing firms in the UK ('information economy' businesses). Exploiting a combination of public, observed and modelled variables, we develop a novel sector-product mapping approach and use text mining to provide further detail on the activities of key sector-product cells. We argue that this provides greater precision and richness than relying on SIC codes and conventional datasets. Overall, we find that the ICT production space is around 42% larger than SIC-based estimates, with at least 70,000 more companies. We also find employment shares over double the conventional estimates, although this result is more speculative. The largest sector-product cells are in information technology (sectors) and consultancy (products); text analysis suggests software, Internet tools, system management and business / finance are particular strengths of companies in these cells. More broadly, ICT hardware, games, ICT-related engineering/manufacturing, telecoms, care and maintenance are key activities across the UK s ICT production activity space. ICT firms are slightly younger than non-ict firms, with a slightly higher share of start-ups; while their average revenues are lower, on some measures revenue growth for ICT firms is higher than for their non-ict counterparts. Defined on a sector-product basis, ICT firms employ more people on average than non-ict firms (although median differences are much smaller). Patent and technologically-orientated trademark holdings are higher for information economy businesses than for non-information economy firms, although the differences are not 36

36 always statistically significant. Information economy businesses are highly clustered across the country, with very high counts in the Greater South East, notably London (especially central and east London), as well as big cities such as Manchester, Birmingham and Bristol. Looking at local clusters, we find hotpots in Middlesbrough, Aberdeen, Brighton, Cambridge and Coventry, among others. We thus find a set of companies that is larger, more established and perhaps more resilient than popular perceptions. These results derive from the many affordances of our dataset, and from the careful cleaning and identification procedures we have employed. Some care has to be taken with the revenue and employment results, since these derive from non-random subsamples, but Gi is able to provide some workarounds for these (such as modelled revenue). Our experiences so far with the Growth Intelligence dataset also provides us with some valuable lessons on the pros and cons of using frontier data for innovation research. The Gi dataset has excellent reach and granularity and, as we have shown, provides significant extra information on fast-changing parts of the economy. We also highlight some challenges. Like other commercial products such as FAME, which we also use here, the Gi dataset is not free to academic researchers and there is no automatic right to access. Similarly, Gi s proprietary layers are based on non-public code, so while validation is possible it is limited by the relative lack of metadata. This may limit wider replicability of the results by other teams and in other country contexts. These constraints are not unique to big data, however. Other issues derive directly from the use of core big data tools and analytics. Web and newsbased information on companies is extremely rich but is not always comprehensive, and needs to be supplemented from other sources. Data providers may throttle information drawn from APIs, which places some constraints on speed of draw-down and thus the real-time character of some unstructured sources. The use of learning routines to generate probabilistic variables is ideal for exploring aggregate patterns in very large datasets, but can become noisy when researchers wish to look at smaller blocs of the data. Taken together, these suggest a number of broader issues for researchers and policymakers. First, researchers should carefully consider the advantages and limitations of off the shelf big datasets, and consider developing their own bespoke information as a complement. Second, government and universities need to develop researcher capacity to generate, as well 37

37 as analyse, unstructured and other frontier data resources. Third, there is a clear need for secure sharing environments where proprietary and public data can be pooled, explored and validated. In the UK, the Secure Data Service provides one potential model for such platform. Finally, and linked to this, there is a need for structured partnership projects to incentivise researchers and data providers to work together. The Gi dataset suggests various avenues for future research. One is exploring co-location and clusters in more detail. Another is to use modelled events as predictors of future observed behaviour. A third is to look at determinants of growth or lifecycle events. In the last two cases, the analysis would need to be done for the sub-sample of companies that can be panellised in the data, and would benefit from merging with administrative datasets. More broadly, this company-level data could be combined with worker-level information to explore how ICTs are changing patterns of labour use and workforce organisation. 38

44 technologies 14,002 1% 3,627 2% 8,418 1% 4,157 2% digital 13,656 1% 1,274 1% 5,877 1% 1,618 1% telephone 13,574 1% 0 0% 6,135 1% 1,210 0% information 13,263 1% 3,957 2% 8,748 1% 4,552 2% Total 884,371 48% 146,776 74% 463,318 57% 174,358 70% Source: Gi data Note: Word appearance refers to the number of time the word appears in the sample of companies reporting token. Relative share is computed as the number of appearances over the total number of words in the sample. Panel A reports words in the tokens in all the companies in the information economy defined including both manufacturing and service sectors. Panel B reports the words in the tokens of the companies in the IT (sector) and consultancy (products). Panel C companies doing consultancy. Panel D companies in the IT sector. 45

45 Table 7. Age of companies, mean and median years of activity. Other Information Economy mean median mean median SIC 07 - manufacturing and services GI sector and product Source: Gi and Companies House data Note: Age defined as years of activity since the company was incorporated Table 8. Distribution of companies by age groups. % Other Information Economy A. SIC 07 - manufacturing and services up to 1 year old up to 3 years up to 5 years up to 10 years B. GI sector and product up to 1 year old up to 3 years up to 5 years up to 10 years Source: Gi and Companies House data Note: Each entry represents the share of companies within each age group 46

46 Table 9: Mean and median revenues and revenue growth from Companies House A. Average Revenues B. Average Annual Revenue Growth Companies House Growth Intel Obs sector distribution Companies House mean median mean median mean median Obs sector distribution SIC 07 - manufacturing and services Other 21,640, ,281 25,780,253 70, , , Information Economy 11,658,404 97,669 13,142,859 83,073 17, , GI sector and product Other 21,605, ,241 25,864,831 68, , , Information Economy 15,130, ,640 16,311,935 91,240 25, , Source: Gi and Companies House data Note: Companies House average revenues are averaged over the period 2010 to Growth Intel revenues are computed over the same sample. For the Companies House dataset if for each company there is more than one observation, only the most recent is kept. Average annual revenue growth is computed on a smaller sample, as information for at least two consecutive years is need. The years considered are the same as above, 2010 to

56 flow (thousands) flow (thousands) Figure 3. Inflow of companies between 1991 and 2011 SIC Codes GI sectors&products Incorporation year Incorporation year Other Information Econ Source: Gi and Companies House data Note: The graphs show the inflow of active companies in each year DRAFT: NOT FOR CIRCULATION OR QUOTATION WITHOUT PERMISSION 57

57 % % Figure 4. Growth rate in the number of firms between 1980 and 2011 SIC codes GI sectors&products Incorporation year Incorporation year Other Information Econ Source: Gi and Companies House data Note: Growth rate as a percentage of number of firms entering the economy each year over the total existing firms DRAFT: NOT FOR CIRCULATION OR QUOTATION WITHOUT PERMISSION 58

Measuring Intangible Investment The Treatment of the Components of Intangible Investment in the UN Model Survey of Computer Services by OECD Secretariat OECD 1998 ORGANISATION FOR ECONOMIC CO-OPERATION

Creative Industries Economic Estimates January 2015 Statistical Release Date: 13/01/2015 The Creative Industries Economic Estimates are Official Statistics and have been produced to the standards set out

UK Service Industries: definition, classification and evolution Jacqui Jones Office for National Statistics Section 1: Introduction Industries classified to services now contribute more to the UK economy

Assessing Industry Codes on the IRS Business Master File Paul B. McMahon, Internal Revenue Service An early process in the development of any business survey is the construction of a sampling frame, and

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION Pilar Rey del Castillo May 2013 Introduction The exploitation of the vast amount of data originated from ICT tools and referring to a big variety

Basel Committee on Banking Supervision Working Paper No. 17 Vendor models for credit risk measurement and management Observations from a review of selected models February 2010 The Working Papers of the

Employer Insights: skills survey 2015 The Tech Partnership is a growing network of employers, collaborating to create the skills for the digital economy. Its leadership includes the CEOs of major companies

WHITE PAPER BUSINESS INTELLIGENCE MATURITY AND THE QUEST FOR BETTER PERFORMANCE Why most organizations aren t realizing the full potential of BI and what successful organizations do differently Research

STATISTICAL RELEASE BUSINESS POPULATION ESTIMATES FOR THE UK AND REGIONS 2013 Summary There were an estimated 4.9 million private sector businesses in the UK at the start of 2013, an increase of 102,000

Embargoed until 10:45 AM - Wednesday, October 26, 2005 Census of International Trade in Services and Royalties: Year ended June 2005 Highlights Major exports of commercial services were: communication,

Introduction This data sheet provides an analysis of earnings data for tech specialists drawing upon published and bespoke data provided by the Office for National Statistics (ONS) Annual Survey of Hours

IBM SPSS Modeler Three proven methods to achieve a higher ROI from data mining Take your business results to the next level Highlights: Incorporate additional types of data in your predictive models By

UNDERSTANDING EQUITY TURNOVER DATA: INITIAL FINDINGS FROM IMA RESEARCH SUBMITTED TO THE KAY REVIEW 1. There are a number of analyses that have referred to general equity turnover figures as evidence of

Secure Thinking Bigger Data. Bigger risk? MALWARE HACKERS REPUTATION PROTECTION RISK THEFT There has always been data. What is different now is the scale and speed of data growth. Every day we create 2.5

JANUARY 1999 Key questions and concepts In its April 1998 Green Paper on welfare reform, the new Government argues that its policy reforms will follow a third way : The welfare state now faces a choice

Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark

3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the

Innovation Benchmarking Survey: New Findings on University Industry Relations and a UK Cambridge Policy Perspective Andy Cosh and Alan Hughes Centre for Business Research Judge Business School University

ASSESSMENT OF THE ONLINE BUSINESS SUPPORT OFFER Growth and Improvement Service, My New Business and Helpline DECEMBER 212 Report by: Centre for Enterprise and Economic Development Research (CEEDR), Middlesex

Sam Zeini, Markus Tünte, Jun Imai and Karen Shire Appendix to Chapter 2 The comparative construction of three different measures of the knowledge economy presented in Chapter 2 required that a number of

CORRALLING THE WILD, WILD WEST OF SOCIAL MEDIA INTELLIGENCE Michael Diederich, Microsoft CMG Research & Insights Introduction The rise of social media platforms like Facebook and Twitter has created new

PRG Symposium Internet of Things From Idea to Scale September 12, 2014 alex.blanter@atkearney.com @AlexBlanter You are here today because you are interested in the Internet of Things and so is everybody

W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

Graham Jones Internet Psychologist Web Success For Your Business How a completely new online strategy will boost your business Contents Introduction...3 Assessing the opportunity...4 It s different online...5

The servitization of manufacturing: Further evidence 1 Andy Neely (adn1000@cam.ac.uk) University of Cambridge Ornella Benedettini University of Cambridge Ivanka Visnjic University of Cambridge/ESADE Business

Creative Industries Economic Estimates January 2014 Statistical Release Date: 14/01/2014 The Creative Industries Economic Estimates are Official Statistics and have been produced to the standards set out

The Business Case for Information Management An Oracle Thought Leadership White Paper December 2008 NOTE: The following is intended to outline our general product direction. It is intended for information

A STUDY BY KPMG UK ICT Outsourcing Service Provider Performance and Satisfaction (SPPS) Study: 2013 A study of the UK Information and Communication Technology (ICT) Outsourcing Market and its Service Providers

Consumers and the IP Transition: Communications patterns in the midst of technological change John B. Horrigan, PhD vember 2014 1 Summary of Findings Americans today have a range of communications services

Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

BIG DATA: IT MAY BE BIG BUT IS IT SMART? Turning Big Data into winning strategies A GfK Point-of-view 1 Big Data is complex Typical Big Data characteristics?#! %& Variety (data in many forms) Data in different

Finance for Small and Medium-Sized Enterprises A Report on the 2004 UK Survey of SME Finances Dr Stuart Fraser Centre for Small and Medium-Sized Enterprises Warwick Business School University of Warwick

Financial Institutions and the cloud: moving from BAU to business transformation Summary Catalyst The role of cloud technology among banks and insurers has been hotly debated over the last 5 years, creating

Startup Business Characteristics and Dynamics: A Data Analysis of the Kauffman Firm Survey A Working Paper by Ying Lowrey Office of Advocacy for Release Date: August 2009 The statements, findings, conclusions,

WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely

Galaxy BI Consulting Services Listening to Business, Applying Technology Who we are Incorporated in 1987. An ISO 9000:2008 organization. Amongst the most respected Information Technology Integrators. Leading

LEVEL ECONOMICS ECON2/Unit 2 The National Economy Mark scheme June 2014 Version 1.0/Final Mark schemes are prepared by the Lead Assessment Writer and considered, together with the relevant questions, by

WHITE PAPER MEASURING GLOBAL ATTENTION: HOW THE APPINIONS PATENTED ALGORITHMS ARE REVOLUTIONIZING INFLUENCE ANALYTICS Overview There are many associations that come to mind when people hear the word, influence.

Industrial Strategy: government and industry in partnership UK Government Information Economy Strategy A Call for Views and Evidence February 2013 Contents Overview of Industrial Strategy... 3 How to respond...

To ensure the functioning of the site, we use cookies. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy &amp Terms.
Your consent to our cookies if you continue to use this website.