Primary Menu

Category Archives: Excel

My Excel/VBA package for simple text mining for working with a known lexicon is now available: License it for free.

From the University of Michigan Office of Technology Transfer Site:

Title: Simple Text Mining

Technology # 4730

Background

Currently, there is a lack of text/network mining software available to the typical analyst end-user. Generally available text mining algorithms require extensive programming to implement. Typically, these more complex algorithms have an extremely steep learning curve, requiring a long-term commitment of professional software developer resources. Such solutions usually cannot be implemented by the typical analyst or small business.

Technology Description

The University of Michigan has developed an Excel-based tool and algorithm for text mining that ‘reads’ blocks of unstructured text for each word in a lexicon (supplied by the user) and assembles the words found into a common network analysis data structure called an “edge list.” This analysis includes additional descriptive data concerning the weight of lexicon words found. This ‘weight’ output allows for analysis of terms found. The network output allows for analysis of term “adjacency,” i.e. appearing together in the same block of unstructured text, the computation of network analysis measures, and the production of network visualizations. Outputs include user-specified data dimensions, carried over from the text input, for easily cross-referenced and more descriptive output.

Applications
• Analysis of unstructured text for a large number of known lexical terms
• Analysis of occurrence and adjacency (co-occurrence) of terms in papers, abstracts, etc.

It’s easier to communicate when your data is the most prominent feature of your chart. Start from good templates.

Basic Excel charts draw focus to themselves instead of the data at hand, by defaulting to include dark gridlines, dark lines and tick marks on each axis, a dark border, color-coded series, and indirect labeling. However, visualization master Edward Tufte and others have taught us that less is often more. By avoiding ‘non-data ink,’ chartjunk, and formatting ‘gloss,’ we can improve the visual clarity of — and therefore the effectiveness of — our data visualizations.

Time is valuable. This means that we should use tools that are good by default. To that end, I have created a series of templates for the six basic Excel chart types.

The basic formatting choices that distinguish these charts from Excel defaults are: light gray gridlines, no axis lines, no tick marks, no borders, and no legend [if you need to describe multiple series, consider the technique of small multiples]. If color encoding becomes necessary, you’ll have to do this manually (as was done for the pie chart at the bottom).

Then, the next time you want to insert a chart, select ‘All Chart Types’ from the bottom of any ‘Insert’ –> ‘Chart’ menu and then ‘Templates.’ You should see any templates saved into your templates directory as options.

Unlock the power of network structures in your data. Learn how to build and analyze networks to gain insights through relationship analysis. Apply approachable techniques and free, user-friendly software. Transform the data you have into the data you need – from relational databases and unstructured text to common network structures.

Jeff detailed his work in the Medical School Grant Review & Analysis Office. Examples will include: Identifying networks of collaborators from eResearch Proposal Management [eRPM PAF] data, discovering networks of concepts in unstructured text, and use cases from other administrative data sets. Jeff’s presentation included:

-“Networks 101″ – The basic building blocks of networks
-How people in any campus unit can apply network analysis
-An emphasis on approachable techniques and free, user-friendly software
-Strategies for effectively visualizing and sharing network-driven insights
-Tools, tips, and tricks

Presented examples were produced with NodeXL, which is a product of Microsoft Research and university collaborators. NodeXL is very approachable — you can produce a network visualization by simply copying and pasting in two columns of properly-formatted data (i.e. an ‘edge list’). It computes a desirable set of network metrics, including eigenvector centrality. Layout, visual encoding, and aesthetics are top notch. It installs into Excel 2007+ as a template.