Comments 0

Document transcript

Data Mining is becoming Extremely Powerful, but Dangerous

N. Kulathuramaiyer, H. Maurer

Abstract

Data Mining describes a technology that discovers non-trivial hidden patterns in a largecollection of data. Although, this technology has a tremendous impact on our lives, theinvaluable contribution of this invisible technology often goes unnoticed.

This paper addresses the various forms of data mining shedding light on its expandingrole in enriching our life. Emerging forms of data mining are able to performmultidimensional mining on a wide variety of heterogeneous data sources, to providesolutions to many problems.

This paper highlights the advantages and disadvantages that arise from the ever-expanding scope of the data mining. Data Mining augments

human intelligence byequipping us with the wealth of knowledge, empowering us to perform our daily taskmore effectively and efficiently. As the mining scope and capacity increases, users andorganisations are now more willing (acceptable) to compromiseprivacy as a trade-off forgaining peace of mind and additional comforts. The huge data stores of the master minersallow them to gain deep insights about individual lifestyles, social and behaviouralpatterns and business and financial trends resulting in

a disproportionate powerdistributions. Is it then possible to constrain the scope of mining while delivering thepromise of better life?

Introduction

As we become overwhelmed by the influx of data, Data Mining presents a refreshingwindow to deal with the onslaught. Data Mining thus holds the key to many unresolvedmysteries and age-old problems, whereby the availability of data and the power toanalyse presents new possibilities. This paper explores this important technologyshedding insights on itstremendous powers and potentials.

According to [Han and Kamber, 2007] data mining is defined as the Extraction ofinteresting (non trivial, implicit, previously unknown and potentially useful) informationor patterns from data in large databases. We take

a broad understanding of data mining,where we also include other related machine based discoveries such as deductive queryprocessing and visual data mining. Databases include both structured data (in relationaldatabases), semi structured data (e.g. metadata in XML documents) as well asunstructured documents such as text documents and multimedia content.

Visual DataMining refers to the discovery of patterns in large data sets by using visualizationtechniques.

As an example, data mining has been widely employed for the learning of consumerbehaviour based on historical data of purchases made at retail outlets. Demographic dataas collected from loyalty cards is combined with behavioural patterns of buyers to enableretailers in designing promotional programmes for specific customer segments. Similarly,credit card companies use data mining to discover deviations in spending patterns ofcustomers to overcome fraud. Through this, these companies are able to guarantee thehighest quality of service to their customers.Another form of mining has been employedin tracing of possible terrorist attacks through the mining of traffic patterns of chatter. Achatter is a electronic signal that is detected on phone lines. [] highlights that a surge inchatter followed by a sudden silence was recorded just before the September 11 incidentas well as before the Bali bombing and other similar incidents.

Despite the success stories in areas such as customer relationship modeling, frauddetection, banking, [KDD], the majority of applications tend to employ genericapproaches and lacks due integration with workflow systems. As such, Data Mining iscurrently at a chasm state and has yet to become widely adopted by the large majority[Han and Kamber, 2007].

The subsequent section gives a broad overview of data mining technology, to provide thebasis for the ensuing discussions on its impact.

Data mining Technology

Mining involves the extraction of patterns from a collection of data via the use ofmachine learning algorithms. Sophisticated mining technologies of today integratemultiple machine learning algorithms to perform one or more of the following functions:

a)

construct an aggregated or personalised predictive model of systems , events andindividuals being studied andsupporting decision making by employing thesemodels in a number of ways (extraction of classification patterns)

b)

identify similarity/dissimilarity in terms of distributional patterns of data itemsand their relationships with associated entities (clustering)

Having access to data thus becomes a powerful capability which can then effectively beharness by sophisticated mining software.

The statement by O’Reilly [O’Reilly] that‘Data is the Next Intel Inside’ illustrates its hidden potency. Data at the hands of creditcard companies, will allow them to profile customers according to lifestyles, spendingpattern and brand loyalty. Political parties are now able to predict with reasonableaccuracy how voters are likely to vote.[Rash,2006]

Data Mining Process

In order to describe the processes involved in performing data mining we divide it into 3phases: domain focussing, model construction (actual mining using machine learningalgorithms), and decision making (applying the model to unseen instances). Jenssen2002, refers to these phases as Data Gathering, Data mining and Decision Making.

Domain focussing

A traditional

data mining architecture [Usama, Fayyad 1996] divides the first phase intosmaller steps such as pre-processing, selection, cleaning, and transforming the datasetinto focussed relations. A well-scoped mining in a well-defined domain area can becharacterised by this traditional model.

However in a more complex data mining application [as in Mobasher,2005] this phase(referred to as Data preparation) may incorporate the use of domain knowledge and sitestructure in discovery of patterns in unstructured data. In this case, the preliminary phaseinvolves activities such as data cleaning, validation of page views and detection ofsession boundaries. In mining unstructured data such as Web logs, there is a need toeffectively identify basic units of user events (may be vague) such as pageviews. Thesepageviews then need to be grouped together to form sessions which may also have greyboundaries.

We describe this phase as domain focussing because, in mining applications such as Websearch and domestic security, this phase itself involves the application of some form ofclustering or incorporate intensive knowledge engineering. For search log mining, amodel charactering user search behaviour is aggregated. Users behaviour patterns [asdescribed in [Colle, Srivastava ] can be employed to structure search logs into intention-related transactions. For anti-terrorism or domestic security [Anderson], this phaseinvolves associational subject link analysis requiring a deep domain analysis (a great dealmanual effortsneeded)

Model Construction Phase

The subsequent phase involves the development of a predictive and or descriptive modelbased on the application of machine learning algorithms. At this model constructionphase a model of generalised patterns are constructed to capture the intrinsic patternsstored in the data. For instance we could have a model of spending patterns of loyaltycardholders, or a descriptive model of SPAM message characterisation.

This (Mining) phase could [Mobasher, 2005] involve the derivation of aggregated usageprofiles based on a multidimensional mining of usage patterns organised according toclustered characterisations. Their mining phase employs multiple machine learningschemes to perform transaction clustering, pageview clustering,associational patternmining and sequential pattern mining in extracting aggregate usage profiles. In themining described by [Mobasher,2005], a number of sessions over a period of time arecombined together to charcterise a user profile.

Clickstream data has been used to model global profiles of buyers indicating details suchas the intensity and urgency of the buyer in acquiring a product. [Hofgesang, &Kowalczyk, 2005]Amazon makes use of clickstream data in this manner to profile usersbased on transactions of book purchases. Their session identification is simpler in thatusers maintain accounts and all purchase transactions are bounded by secure sessions.Apart from that, Amazon is also able to employ other meta-data captured in user accountsand other contributions of users (editing, review) to characterise profiles.

We describe this phase as model construction to also highlight the data integration frommultiple data sources that is performed in emerging applications. It has to be noted thatdatabase matching or integration is performed across all three phases.

As highlighted in [Mobasher] e-commerce applications employs the integration of bothuser data such as demographics, ratings, purchase histories together with productattributes from operational databases to enable the discovery of important businessintelligence metrics.

Decision Making Phase

The third phase involves the application of the model generated to perform decisionmaking. This is an important phase where profiling and user modelling are then appliedto life situations. Simplistic applications of data mining tend to merely employ the modelto predict likelihood of events, occurrences, based largely on past patterns. Amazon, forexample, is able to recommend books according a user’s profile. Similarly, networkoperators are able to track fraudulent activities in usage of phone lines, by trackingdeviation patterns as compared to standard usage characterisation. User profiling incomplex applications can be used as a basis for conviction and used to make furtherdiscoveries.

The next section focuses on Web search as a complex form of knowledge discoverywhere some form of mining is performed in almost every stage within the 3 phases ofmining discussed. Google for example employs Spell checking and automaticsuggestions (Google Suggest) at the data cleaning stage (incorporating clustering). Theadvantage of performing mining at this stage allows the filtering of queries and thecaching of results to reduce the load on the ‘full-search’ miner.

Web Search as Data Mining

Web Mining can typically be divided into Web Page content mining, Web structuremining and Web log mining (including search log). Traditional search engines utilisedweb content only for building their index of the Web. Web structure has becomeimportant in current search engines which uses web structure patterns to determinepopularity of websites. Web log mining has already been addressed adequately in theprevious section. Leading search engines of today combine these three forms of mining toprovide results that is able to meet users needs better.

The web has emerged as a massive repository of information with billions of web pages,massive collections of multimedia documents, millions of digitised books, decades offinancial documents, world news in almost all languages, massive collection ofcommunity-tagged multimedia object and the list goes on. Search engines have turnedthis repository into a massive data warehouse as well as a playground for automateddiscovery of hidden treasures. Web Search is thus viewed as an extensive form amultidimensional heterogeneous mining of a largely unstructured data in uncovering anunlimited number of mind-boggling discoveries. The scale of data available is in therange of peta bytes, [Witten] and it much greater than the terra bytes of data available atthe hands of large global corporations such as Walmart.

As compared to the Data Mining process described in section [], Web search is a muchmore complex process. Figure 1 illustrates the scope and extent of mining performed bysearch engines.

of data resources iseffectively exploited in their ability to support users’ decision-making process, as well asin providing alternative channels for further investigation. Search engines can eithersimultaneously or incrementally mine these datasets to

over a period of time, search engines have access to a greatdeal of insights into lives of presumably ‘anonymous’ searchers. A search query indicatesthe intent of a user to acquire particular information to accomplish a task that relates tosome aspect of his or her lifestyle. This ability to capture intent opens up a great dealpossibilities for search engines. The sensitive nature of this data is described in section[].The global patterns in search query logs to provide insights on the usefulness of particularkeyword for an inquiry.

Search traffic patterns is another data source that can be applied to highlight relationshipsbetween search terms and events. For instance the number ofsearches for “Christmaspresents” peaks in the early part of the month of December. [Heather Hopkins, 2007]Search trafficdata analysis have also been shown to reveal social and market patternssuch as unemployment and property market trends (see Trancer, 2007). Apart from thatthe intentions of global users can be modelled by terms employed in search. A suddenburst of search term frequency have been observed seeking quick answers to questionsposed in reality shows, such as “Who wants to be a Millionaire”.[Witten] An emergingparadigm, mashups (see Kulathuramaiyer, Maurer, 2007) together with mobile webservices allows the discovery of localised contextual profiles.

Targeted advertisements based on keyword-bidding is currently employed by searchengines.

In the near future, complex mining capabilities will provide personalisedcontext specific [Lenssen] advertisements.Itwould be possible via RFID technology, fora user passing by an intelligent billboard, [Google smart billboards] to encounter a highlypersonalized messages such as ‘Nara, you have not purchased your airline ticket yet, youhave only 2 weeks for your intended flight. I know of a discount you can’t refuse.’ Thislevel of user profiling could be achieved merely by utilizing shopping cart analysis,together with cookies and calendar entries. Figure 2 illustrates the layered mining thatcould be employed to facilitate such a discovery. This is describe by [Kulathuramaiyerand Balke] as connecting the dots, to illustrate the abilityto extract and harnessknowledge from massive databases at an unprecedented level.

Figure 2: Connected Mining based on Database Matching

The next section describes emerging forms of complex data mining, which would requireto combines many of the above mining functions together and more.

Applications of Data Mining

Environmental Modelling applications

There are complex problems for which data mining could be used to provide answers byuncovering patterns hidden beneath layers of data. In many cases, domain focussing hasin the past has been the biggest challenge. The layered mining of heterogeneous

data asdescribed in the previous section presents new possibilities towards the unearthing ofdeep-rooted mysteries. As an example, data mining could be employed for the modellingof environmental conditions in the development of an early warning systemto address awide range of natural disasters such as avalanches, landslides, tsunami and otherenvironment events such as global warming. The main challenge to addressing such aproblem is in the lack of understanding of structural patterns characterisingvariousparameters which may currently not be known.

As highlighted by [Maurer, et al], although a large variety of computer based methodshave been used for the prediction of natural disasters, the ideal instrument for forecastinghas not been found yet.

As highlighted in their paper, there are also situations wherebynovel techniques have been employed but only to a narrow domain of limitedcircumstances.

Integration of multiple databases and the compilation of new sources of data are requiredin the development of full-scale environmental systems. As advances in technology allowthe construction of massive databases through the availability of new sources of data suchasmultimedia data and other forms of sensory data, data mining could well provide asolution. In order to shed insights on a complex problem such as this, massive databasesthat was not previously available need to be incorporated e.g. data about after eventsituations of the past [Maurer, et al]. Such Data on past events could be useful

inhighlighting pattern related to potentially in-danger sites.

Data to be employed in thismining will thus comprise of both ofweather and terrestrial parameters together withother human induced parameters such as vegetation, deforestation over a period oftime.[Maurer, et al]

Domain focussing will be concerned with discovery of causal relationships (e.g usingBayes networks) as a modelling step. Multiple sources of data which include new sourcesof data need to be applied in the discovery of likely causal relationship patterns. Acomplex form of data mining is required even at the phase of domain focussing. This willinvolve an iterative process whereby hypothesis generation could be employed to narrowthe scope of the problem to allow for a constrained but meaningful data collection. Forcomplex domains such as this, unconstrained data collection may not always be the bestsolution. Domain focussing would thus perform problem detection, finding deterministicfactors and to hypothesise relationships that will be applied in the model. [Beulens et al,2006] describe a similarly complex representation system for an early warning system forFood supply networks.

Subsequently, the model construction phase will employ a variety of learning algorithms,to profile events or entities being modelled. As this stage may negate model relationships,domain focussing may need to be repeated and iteratively performed to refine further.The model construction phase should allow the incremental development of a model,based on a complex representation of the causal networks.[Beulens et al, 2006]

Model construction phase will explore the use of mining methods such as clustering,associational rule mining, neural networks etc. to verify the validity of causalassociations.Once a potential causal link is hypothesised, verification can be done byemploying various data mining methods. [Beulens,et al] have proposed a combinations ofapproaches which include deviation detection, classification, dependence model andcausal model

generation.

The Decision Making phase will then employ the validated causal relationship model inexploring life case studies. Data Visualisation will need to be employed in such ascenario to contrast between the two clusters. An environment for an interactiveexplorative visual domain focussing is crucial, to highlight directions for further research.Data mining could serve as a means of characterisation of profiles for both areas pronesto disasters and those which are safe.

Until the domain focussing is effectively achieved, a semi-automated solution [Pillmann,2002] may be the best solution. Alternatively software agents could employed to performautonomous discovery for tasks such as validating causal links.

Medical Applications

We will briefly discuss another form of mining that has a high impact. In the medicaldomain, data mining can be applied to discover unknown causes to diseases such as‘sudden death’ syndrome or heart attacks which remains unresolved in the medicaldomain. The main difficulty in performing such discoveries is in collecting the datanecessary to make rational judgements. Large databases need to be developed to providethe modelling capabilities.These databases will comprise of clinical data onpatientsfound to have the disease, and those who are free of it. Additionally

non-traditional datasuch as includes retail sales to determine purchase of drugs, and calls to emergency roomtogether with auxiliary data such as microarray data in genomic databases andenvironmental data would also be required. [Li]

Non traditional data could also incorporate major emotional states of patients byanalyzing and clustering the magnetic field of human brains which can be measured noninvasively using electrodes to a persons’ heads.[Maurer,et al] Social patterns can also bedetermined through the profile mining as described in the previous section to augment thefindings of this system.Findings of functional behaviour of humans via the genomicdatabase mining, would also serve as a meaningful input.

The development of large databases for medical explorations will also open possibilitiesfor other discoveries such as Mining family medical history andsurvival analysis topredict life span.[Han and Kamber]

Advantages of data Mining

Data

mining has crept into our lives in a variety of forms. It has empowered individualsacross the world to vastly improve the capacity of decisionmaking in focussed areas.Powerful mining tools are going to become available for a large number of people in thenear future. This section describes the advantages of data mining.

Data mining will enhance our life in a number of ways which include the enablingdomestic security through a number of surveillance systems, better health trough medicalmining applications, protection against many forms of intriguing dangers, and access tojust-in-time technology to address most of our need. Mining will provide companieseffective means of managing and utilising resources. People and organizations willacquire the ability to perform well-informed (and possibly well-researched) decision-making. Data mining also provides answers through sifting through multiple sources ofinformation which were never known to exist, or could not be conceivably acquired toprovide enlightening answers.

DM could be combined with collaborative tools to further facilitate and enhancedecision-making in a variety of ways.Data mining is thus able to transforms personal ororganizational knowledge which may be locked in the heads of individuals (tacitknowledge) or in legacy databases, to become publicly available.Many more newbenefits will emerge as technology advances.

Disadvantages of Data Mining

Having seen the powers of this fascinating technology an its profound impact andinfluence onour lifestyles, we will now explore the potential dangers of this technology.As with all forms of technology, there is a need to explore both sides of the coin.

In order to illustrate the privacy concerns of data mining, we describe the sensitive natureof web search history. Search history data represents an extremely personal flow ofthought patterns of users that reflects ones quest for knowledge, curiosity, desires,aspirations, as well as social inclinations and tendencies. As such it is not surprising thata large amount ofpsychographic data

such as user’s attitudes towards topics, interests,lifestyles, intent and belief can be detected from these logs. The extent of the possiblediscoveries has been clearly illustrated by the incidence where AOL released personalsearch of 658,000 subscribers [Jones, 2006]. This incident has exposed the sensitivity ofinformation at the hands of search engines.

A great deal of knowledge about users is also being maintained by governments, airlines,medical miners,shopping consortiums A valid concern would be that the slightest leakcould be disastrous. Figure 3 illustrates the amount of knowledge about anonymous usersthat could be established by global search engines, via the connection of dots. (seeKulathuramaiyer and Balke 2006)

Fig. 3 Search History can reveal a great deal of information about users

Other forms of mining that may be capable of even more dramatic privacy infringementsinclude Real-time outbreak and disease Surveillance program as an early warning forbioterrorism, [Spice] Total Information Awareness program,[Anderson] and TheAutomated Targeting System [ATS].

Particularly in these types of applications, another common danger is profiling wherethere is a possibility of drastic implications based on the mining results such as an arrest.There is a danger of generalizations to be characteristics of factors such as race, ethnicity,or gender, rather than on deeper, moremeaningful indicators. Another danger is theprevalence of false positives, where an entirely innocent individual or group is targetedfor investigation because of poor decision making. To illustrate the danger of falsepositives, areasonable rate of success of 80% was considered for an application such asTIA. [b] This will result in 20% of US citizens (48 million) being considered falsepositives.

[b]

Data mining

will empower mining giants to be able to go beyond the ability toPREDICT what is going to happen in a number of areas of economic importance,but actually have the power to KNOW what will happen, hence can e.g. exploitingthe stock market in an unprecedented way. They also have the capacity to makejudgements on issues and persons with scary accuracy.

Data mining has thus puts in the hands of a few large companies the power to effectthe lives of millions by the control it has on the universe of information.Theunconstrained expansion of their business scope embodies them with theomniscience to affect our lives.

The next section solutions discuss solutions to constrain the scope and visibility ofmining without compromising on the extent of discovery.

What can we do?

In order to avoid the dangers of connecting the dots, two approaches have

In this distributed approach, separate facilities will be adopted for the development ofsoftware for document similarity detection. (similar capability is found in search engines)Each distributed site has the responsibility for performing deep but focussed mining of asingle domain of specialisation (i.e. Computer Science, Psychology). Facilities such asthis can be established in numerous localities throughout Europe and even across theworld to effectively address multiple disciplines and languages. This will also address thethe current problem with

search engines which tend to be too generic.

[S.J. Vaughan-nichols]

This proposal ensures that no central agency will have an exclusive control overpowerful technology and all resources. In order to ensure the neutrality of content, allsuch facilities will need to be managed by not-for-profit agencies such as universities andpublic libraries.

Anonymous Mining

[Kovatcheva] has a proposed a means of protecting the anonymity of surfers by the use ofAnonymity agents and pseudonym agents

as the prevent the

need for users to beidentified. Their paper also proposed the use of negotiation agents and trust agents toassist users in reviewing a request from a service in being able to make a rational decisionof allowing the use of particular personal data.

A similar agent-based approach is highlighted by [Ka Taiplae] via rule-based processing.

First, an "intelligent agent" is used for dispatching a query to distributed databases. Theagent will then negotiate access and permitted uses for each database. Secondly, dataitems themselves are labeled with meta-data describing how that item must be processed.Thus, even if a data item is removed or copied to a central database, it retains relevantrules by which it must be processed.

Value Sensitive Design has been

proposed by [friedman] which employs logicalmodelling to account for human values in a principled and comprehensive mannerthroughout the design process.Another anonymisation step has also been proposedthrough a framework by [e].

The main challenge lies in coming up with guidelines and rules such that siteadministrators or software agents can use to direct various analyses on data withoutcompromising the identity of an individual user. Furthermore, there should be strictregulations to prevent the usage data from being exchanged inappropriately or sold

to other sites. Users should also be made aware of the privacy policies of any given site,so that they can make an informed decision about revealing their personal data. Thesuccess of such guidelines

can only be guaranteed if they are backed up by a legalframework

Conclusion

As data mining matures and becomes widely deployed in even more encompassing ways,we need to become aware on how to effectively enrich our lives. At the same time, thedangers

associated with this technology needs to be minimised by deliberate efforts onthe part of enforcement agency, miners and the users of the system.

The powers to enhance our lives with the promise of unlimited knowledge, will make theworld much more exciting, by opening up numerous possibilities.As the degree of userprofiling of BSEs can be mind boggling, drastic actions are required fast.

Effectivemeasures are required in curtailing the dissemination of private information. Apart fromthat international laws need to be in place to ensure a balanced growth and control ofresources.

References

Battelle,J., 2005, The Search-

How Google and Its Rivals Rewrote the Rules of Businessand Transformed our Culture, Porfolio, Penguin Group, New York, 2005

David Jenssen, "Data mining in networks." Invited talk to the Roundtable on Social andBehavior Sciences and Terrorism. National Research Council, Division of BehavioralandSocial Sciences and Education, Committee on Law and Justice. Washington, DC.December 11.2002