A Statistical Text Mining Method for Patent Analysis

Transcription

1 A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, Abstract Most text data from diverse document databases are unsuitable for analytical methods based on statistics and machine learning algorithms. Patent documents are also compiled into text datasets. Similar to other document datasets, we therefore need to transform patent documents into structured data for a statistical analysis. This transformation is performed using the preprocessing of text mining techniques. We can analyze the patent documents after their preprocessing. For a patent analysis, two phases, preprocessing and analysis, are required. In this paper, we try to combine the two phases into one. We propose a statistical text mining method to improve the performance of a patent data analysis. Our proposed method carries out text mining and a statistical analysis at the same time. To show the contribution of our study, we illustrate how it can be applied in a real domain using a target technology. Keywords: Statistical text mining, Preprocessing, Statistical analysis, Patent system, Technology forecasting, Patent analysis 1. Introduction A patent is a type of intellectual property (IP). Unlike IPs such as trademarks, copyrights, and trade secrets, a patent includes diverse information of technological development [1]. In addition, the patent system protects the exclusive rights of inventors regarding their developed technology for limited periods of time [2]. Many companies hope to analyze patent documents to ascertain certain technology trends. However, because patent data include text, numbers, dates, and pictures, patent data are unsuitable for analysis methods such as statistics and machine learning algorithms [1], which require structured data consisting of numbers or frequencies. To solve this problem, we should preprocess patent documents using text mining techniques [3]. Text mining is a data mining process used to manage and analyze texts or document data [4]. We therefore use text mining methods to preprocess patent data to build structured data for a statistical analysis. In this paper, we combine statistics and text mining for a patent analysis. This type of combination is called statistical text mining (STM) and has been used in medical and bio sectors [5][6]. This research provides another STM approach for patent analysis, and differs from previous studies. In our STM, we used descriptive statistics and a multivariate analysis based on statistics. In addition, using text mining techniques, we import data, create a corpus, and eliminate whitespace and stop words. In other words, we apply basic and advanced analyses to structured patent data after the preprocessing of patent documents. The STM results can be used for R&D planning, technology management, or new product development. To show how our STM can be used in a real domain, we perform a case study using retrieved patent documents from the Korea Intellectual Property Rights Information Service (KIPRIS) database. We then apply our STM process to this case study in a step-by-step manner. 2. Statistical and text mining Statistics and text mining are both data mining tools [4]. Statistics is a leaning method for prediction and forecasting [7]. Most statistical methods are based on a probability distribution such as a binomial or normal distribution. Text mining is an analytic process for discovering novel information from a large amount of text data [4]. Differing from general data mining, text mining requires some natural language process techniques because the preprocessing of text data should be performed to analyze text data. International Journal of Advancements in Computing Technology(IJACT) Volume 5, Number 12, August

3 the basic and advanced analyses of STM are based on statistical methods. Figure 2 shows the step-bystep process of our proposed STM method. Figure 2. Proposed STM process We retrieve patent documents of a given technology and the patent data is not suitable to statistical analysis. We therefore transform the documents into structured data in the form of a patent keyword matrix. The rows of the matrix consist of searched patents. All keywords extracted from the patents are in the columns of the structured matrix. In addition, each value of the matrix is the occurrence frequency of the keyword in the patent, which is a numerical data type. We can therefore analyze this matrix through statistics. The results of a statistical analysis may reveal novel knowledge and hidden patterns, and can be used for the R&D planning and new product development of a company after applying the opinions of domain experts. In the next section, we describe a case study suggesting how the proposed STM can be applied to a real-world problem. 4. Experimental results For our case study, we searched for patent documents from the KIPRIS patent database [11]. The keyword equation is as follows: title = text * mining; in other words, we retrieved patent documents related to text mining technology. In addition, we obtained patents from the United States and Europe on July 5, The patent data consisted of 67 documents in total. Among them, 60 were US patents, and the remaining were Europe patents. Next, we illustrate the case study for STM in a step-by-step manner Structured data construction In this paper, we performed text mining techniques to construct structured data for a statistical analysis. To preprocess the patent data used in our research, we used the text mining package of the R project [12][13]. We downloaded the patent data as an Excel file and imported it into the R project. We created a corpus of 67 text documents. From this corpus, we constructed a patent term matrix using 67 documents and 1439 terms. We found the following highly frequent (over 10) terms in the structured data: also, analysis, analyzed, and, apparatus, are, associated, based, between, business, can, candidate, characteristic, coincident, computer, condition, confidence, content, corresponding, data, definition, degree, device, disclosed, document, each, element, entities, extracted, extracting, extraction, extract, feature, for, frequency, from, generating, generation, has, include, including, information, inherent, input, into, knowledge, language, least, list, may, means, method, mining, more, obtained, one, phrase, piece, plurality, portion, predetermined, present, processing, program, provided, query, referring, result, said, search, section, set, structured, such, system, term, text, textual, that, the, this, topic, unit, used, user, using, value, video, web, which, with, within, word, and words. 146

4 To build the patent keyword matrix, we selected keywords from the frequent terms by eliminating stop and common words. In this way, we determined the following candidate keywords: analysis, associated, business, candidate, characteristic, coincident, computer, condition, confidence, content, data, definition, degree, device, disclosed, document, element, entities, extract, feature, frequency, generation, including, information, inherent, input, knowledge, language, list, means, method, mining, phrase, piece, plurality, portion, predetermined, present, processing, program, query, referring, search, section, set, structured, system, term, text, topic, unit, used, user, value, video, web, and word. The final keywords can be determined based on the knowledge of experts in the text mining field. In this paper, we determined the top ten keywords from the candidates other than text and mining. Table 2. Top ten keywords for text mining patents Keywords Business, confidence, content, disclosed, extract, feature, information, query, search, structured We therefore constructed a patent keyword matrix using 67 patents and 10 keywords. Using this matrix data, we performed basic and advanced analyses, the results of which were used for practical application Basic analysis STM provides a summary and visualization of the searched patent data in the basic analysis step. The following figure shows the number of patents by year. Figure 3. Number of applied patents The first patent related to text mining was applied in We therefore can see that the technological development of text mining was started comparatively recently. The development of text mining technology has also recently decreased. In addition, we summarized the International Patent Classification (IPC) codes of the searched patent data. IPC is a patent classification system from the World Intellectual Property Organization [14]. The following table shows the IPC codes used in text mining patents and their representative technology. Table 3. Number of IPC codes IPC C12Q G06F G06K G06N G06Q G06T G10L Number of IPCs We found that the text mining patents consisted of seven IPC codes. Most technologies of these text mining patents are based on the IPC code, G06F, which describes the technology of electric digital data processing [14]. The IPC code, C12Q, represents the technology of measuring or testing 147

5 processes involving enzymes or micro-organisms, and the remaining IPC codes are related to information technologies. We therefore can see that text mining technology is an interdisciplinary area for developing technologies. Next, we attempted to find more detailed knowledge through an advanced analysis Advanced analysis In this section, we applied a statistical analysis to the structured data. First, we performed the following correlation analysis: Figure 4. Correlation coefficient matrix between keywords We know that confidence, feature, and structured have relatively large correlation coefficients for text. In addition, mining has a relatively large correlation coefficient between confidence, and we therefore concluded that the development of text mining technology is needed for a confidence technology. In addition, we should consider featured and structured approaches for text mining technology development. We performed another STM analysis as follows. Table 4. Influence of keywords on text mining Keywords Text Mining estimate significance estimate significance Business Confidence Content Disclosed Extract Feature Information Query Search Structured We carried out multiple regression analysis, estimate and significance are regression parameter and probability value (p-value) respectively. We can find the relative influence of a keyword to text or mining using the estimated value, and statistical testing using the significance (less than 0.05). Similar to the results of a correlation analysis, we found that the confidence technology is important for the development of text mining technology. We next applied our results to a practical problem. 148

DATA ANALYTICS USING R Duration: 90 Hours Intended audience and scope: The course is targeted at fresh engineers, practicing engineers and scientists who are interested in learning and understanding data

Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

WEB PAGE CATEGORISATION BASED ON NEURONS Shikha Batra Abstract: Contemporary web is comprised of trillions of pages and everyday tremendous amount of requests are made to put more web pages on the WWW.

Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

Economic Order Quantity and Economic Production Quantity Models for Inventory Management Inventory control is concerned with minimizing the total cost of inventory. In the U.K. the term often used is stock

New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark

International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

Optimization Based Data Mining in Business Research Praveen Gujjar J 1, Dr. Nagaraja R 2 Lecturer, Department of ISE, PESITM, Shivamogga, Karnataka, India 1,2 ABSTRACT: Business research is a process of

Question 1 Cluster hypothesis states the idea of closely associating documents that tend to be relevant to the same requests. This hypothesis does make sense. The size of documents for information retrieval

Who holds the most cloud computing patents now? A preliminary analysis Prior to becoming an IBMer, I was very fortunate to work at a company called Transpacific IP (a well-known intellectual property acquisition,

Big Data: Rethinking Text Visualization Dr. Anton Heijs anton.heijs@treparel.com Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important