Splunk, a company best known for systems that analyze machine-generated data, has created a tool for analyzing comments filed at Regulation.gov.

Interested in what this tool might show, Computerworld asked Splunk to use it to analyze comments on a proposed new rule regarding the ability of some H-1B spouses to work in the U.S.

The U.S. has been collecting the comments on a regulation change that will allow spouses of H-1B holders who are seeking green cards or permanent residency to get work authorization. Today, these spouses cannot work in the country.

President Barack Obama's administration proposed the rule change, and its adoption is all but assured. But the U.S. is still following procedures and will be collecting comments about the change through July 11.

A cursory read of the comments shows that people who favor the rule change appear to outnumber, by a wide margin, people who oppose it. But still, reading through 7,000 plus comments to gather sentiment is a serious time investment. Could this process be automated?

Splunk's just released the eRegulations Insights tool uses the government open data API program to collect the comments and conduct an analysis of unstructured data - data outside a database.

An analysis of 6,650 comments, the size of the comment pool at the time the tool was run, identified 615 unique comments, or 9% of the total.

Of the 6,035 non-unique comments, 453 were exact duplicates of 10 different comments.

Corey Marshall, director at Splunk4Good, an organization Splunk created to apply its data analysis tools to public information sources, said the non-unique comments are defined as clusters of sentiment that are identical. They are opinions "which can originate independently, but often originate from other sources," said Marshall.

The writers in this category may be responding to a campaign to get them to voice their concern or support about the new rule and using these sources to help pen their comment.

"They may not be cutting and pasting identically the entire passage of a message but they may be picking and choosing the things that fit their beliefs," said Marshall.

Splunk then applied the Naive Bayes classifier, a data mining tool for categorizing text and determining sentiment on a scale of one to 10. The higher the score, the higher the happiness. In this case, the sentiment score for the H-1B regulation was about 6.2.

Marshall said the sentiment score can be construed as indicating who is in favor of something, but cautions, as a general rule, against that conclusion and said the sentiment score can't be viewed as conclusive. He does points out that of all the agencies, the National Park Service gets the highest sentiment scores because people are generally pleased with that agency.

Marshall said its eRegulations Insights provides people the "tone of the conversation" and "and whether or not there are large clusters of political action taking place."

The sentiment analysis is not yet available for use on all the regulations, including the H-1B, but Splunk made it available for this story.

The Splunk4Good effort has produced a number of data analysis tools, includes ones that look at federal election spending.