Wednesday, February 15, 2012

Mining Gold from Big Data with Text Analytics

Sunday’s New York Times featured a news analysis article about the age of big data and how that means more analysis and technologies are being applied to domains which formerly seemed removed from data crunching—political science, sports, advertising, public health, and more. Technology reporter Steve Lohr highlights “a drift toward data-driven discovery and decision-making.”

Although the article emphasizes number crunching, at Basis Technology we’ve seen this same “drift” across a number of industries as companies attempt to extract value from massive amounts of unstructured text data. We consider this trend to be a validation of our approach to text analytics. Our earliest product customers were web search engines like Lycos, Google, and Bing—the first online technologies to encounter “big data”—and we now work with companies monitoring tweets, blogs and other social media. These companies, and the government agencies we also work with, all deal with the problem of slicing and dicing oceans of text data to find useful tidbits (search engines and compliance) or to come to an aggregate understanding of the whole data set (business intelligence and social media monitoring).

Recently, we’ve seen businesses sit up and take notice of one tool in particular, entity extraction—the automatic extraction of people, places, organizations and other “significant” categories from text. This text analytics tool has been around a long time, but it’s only now that we are seeing a broad range of industries adopting it. In social media analysis, entities are mapped to sentiment (think entities like “Dunkin Donuts” being linked to social media comments). Plugging entity extraction into government intelligence, may reveal trends and patterns based on the rise and ebb of entities. Publishing may use entities found in unstructured text to link disparate data sources via common entities.

Lohr’s article quotes a January report by the World Economic Forum in Davos, Switzerland which “declared data a new class of economic asset, like currency or gold.” As we’ve seen with our customers though, having big data, without an automated way to get through it all is like possessing a vein of gold embedded in a mountain. Text analytics is making it possible to aggregate and annotate existing information, link between information repositories, and provide a comprehensive view of the data to the end user.

This point was also highlighted by Andrew Jordan, the CTO and COO of the Accelus division of Thomson Reuters in an interview with the BBC, where he describes their “Content Marketplace” initiative which enables better access to data across their organization. Jordan describes this initiative as “[creating] something bigger than the sum of the parts.” We think that’s a good summary of the value of big data technologies.

Tuesday, January 31, 2012

Thoughts about Technology for E-Discovery

As the Director of Product Management at Basis Technology, I work with our sales and marketing departments on a regular basis. In fact, I meet with our existing or potential customers on a weekly basis. I find these meetings very interesting, and they play a central role in informing our product strategy. Sometimes, however, I find myself thinking of rather unique or complex ways that some of these companies could combine our existing technology or products. In some cases, my concepts are easy to express, and in other cases my ideas are too difficult to explain in a brief meeting that has other goals. I’ll use blog posts like this one to share some of my thoughts that fall into the latter category.

For my first post, I would like to share some of my thoughts about technology that I think could benefit the e-discovery market. To date, the law firms and enterprises using e-discovery products have made big strides in adopting technology to assist in improving the efficiency—and reducing the costs—of running the e-discovery process.

While the e-discovery process is complex (Early Case Assessment, Electronic Discovery Reference Model), as a casual observer, I have noticed that a large fraction of the actual time is spent on one of two general tasks:

  • Finding documents that may be of interest
  • Reviewing the documents of interest
The process of reviewing the documents is the single most expensive step, and this is where most of the vendors have focused their technology. I believe that there is additional technology that can be used to further streamline both of these steps.

Here are some technology areas with obvious, and non-obvious applicability to e-discovery.

Language Identification - Single and Mixed
Many of the major vendors are using this technology already. It is quite useful to survey the set of languages in the collected documents, so that you can better estimate the potential costs and challenges of hiring language resources.

As a side note, many of our e-discovery customers have found that identifying only one “dominant” language isn’t enough. Many corporations have lengthy email disclaimers that are automatically attached to each email. If the body of your email is short, and is in a different language from the disclaimer, returning a single language isn’t accurate. To address this, our Rosette® Language Identifier has feature we call the “language boundary locator,” which looks inside a document to identify different language regions in a text.

Term Expansion
One small, but important, part of the Meet and Confer process is for the parties to agree on a set of keywords that will be used to retrieve the documents for review. Today, this is largely done manually—lawyers build the set of keywords from memory, attempting to cover common variations by manually expanding their query, or using wildcards. While this process can be effective, it is time consuming, and can easily return unrelated documents—or worse, it could miss related documents that used different inflected or conjugated forms of the keyword.

By using term expansion, a new feature of Rosette Base Linguistics, it is easy to see all of the inflected forms of a particular keyword. For example, a keyword of “child” should probably include “children,” and a keyword of “steal” should probably include the terms “stole,” “stolen,” “stealing,” and “steals.” While there are many examples of this in English, the problem is significantly worse on other languages. The Spanish word “pasaportar” has more than 50 inflected forms, and I wouldn’t want to try listing them all by memory. The most difficult languages for keyword search are Chinese, Japanese, and Korean. These languages are written without spaces, so just finding the words is a challenge that must be overcome.

Entity Extraction + Name Searching
Often, entities—persons, organizations, and locations—are prominent actors in legal matters. One challenge is that these entities are often mentioned in documents, emails, and other pieces of unstructured text. These references are often casual, include nicknames, abbreviations, initials, or partial references. Traditional keyword search techniques are ineffective at handling these variations, and attempts to manually guess the variations are bound to be incomplete.

For example, in the Enron emails, Vincent Kaminski was a prominent figure. Did you know that his name appears in the data in the following ways (with mention count):
  • Vince Kaminski 3167
  • Vince Kaminki 5
  • Vince Kaminiski 40
  • Vincent Kaminski 527
  • Vince Kamainski 2
  • Vince Kamnski 2
  • Vince Kaminsky 106
  • Vince J Kaminski 14332
  • V. Kaminski 42
  • Vince K 28
How did I find these different spellings? Great question.

By combining a powerful statistical entity extractor with a flexible name matching engine, I was able to run a single search for “Vince Kaminski,” and return the results you see above.
At the time I was processing the data, I ran Rosette Entity Extractor on each email and document. The extractor found mentions of persons, locations, and organizations, regardless of spelling, and it did not require any training or tuning. As each entity was identified, I added it to Rosette Name Indexer, which is a flexible, fuzzy search engine for names. Once all the documents were processed and the names indexed, I queried the name index to find out whether other similar names were mentioned anywhere in the set of documents.

This method allows me to interactively expand my query and ensure that my keywords are not missing any of the other references to the people, locations, or organizations that are important to my case.

To date, I have seen a number of e-discovery technology companies begin to adopt advanced technologies: language identification, near-duplicate detection (which I will discuss in a later blog entry), and even predictive coding. Going forward, I hope to see a growing acceptance of the importance of advanced keyword search techniques, and the value of combining entity extraction with name searching.

Thursday, January 19, 2012

Speaking At the U.S. Department of Defense CyberCrime Conference 2012

Two members of our digital forensics group are presenting at the DOD Cyber Crime Conference next week (1/26-1/27). Brian Carrier, our VP of Digital Forensics, will be giving two talks. One is on the analysis of Chinese knock-off cell phones and the other is on recent advances in the Sleuth Kit and Autopsy tools. Both tools are open source and Basis Technology has been doing a lot of work on them in the past year. The talk will cover what is new and what is coming next year.

The other talk is by one of our lead examiners, Heather Mahalik. She will be giving a talk on analyzing mobile device backup files that may exist on a hard drive.

I’ve listened to both of them talk at the Basis Technology-sponsored open source digital forensics conferences, and they are both informative and engaging speakers.

Thursday, Jan. 26, 2012

Analyzing the Knockoffs: MediaTek-based Phones
Speaker: Brian Carrier
When: 9:30-10:20am
Where: In the Forensics Track, Centennial Ballroom 1

Mediatek makes inexpensive chipsets for cellphones and they are commonly seen outside of the U.S. According to press releases, they will be commonly seen in the U.S. in the coming years as they are incorporated into non-Smart phones. This talk will cover approaches to logically and physically acquiring the devices. It will also cover what data can be found on the phone and where to find it.

Friday, Jan. 27, 2012

Backup File Forensics
Speaker: Heather Mahalik
When: 10-10:50am on 1/27
Where: In the Forensics Track, Hanover E

Mobile devices are often backed up to hard drives and media cards. These backup files may be encrypted by backup software, such as iTunes and BlackBerry Desktop Manager, making examination more difficult. Commercial forensic tools and software are available for decrypting, analyzing and creating backup files from mobile devices.

This presentation will explore the differences in backup files: where they reside and what tools can be used to examine them. We will focus on a variety of backup files, including iOS, Android, and Blackberry. The content within the different backup files will be addressed in the presentation. Decryption, analytical, and parsing tools will be shown to provide further understanding to examiners in the audience.

We will discuss the ability to acquire mobile devices using a backup tool, which in some cases may be the only way to pull data from a device. Examples will include a comparison of a backup file from a mobile device to a logically acquired device to determine the differences in content.

Friday, Jan. 27, 2012

Advances in The Sleuth Kit and Autopsy 3 Open Source Forensics Tools
Speaker: Brian Carrier
When: 11-11:50am on 1/27
Where: In the Forensics Track, Inman

This talk will cover new features and functionality of The Sleuth Kit (TSK) and its related open source tools. In the past year, we’ve had TSK releases dealing with new application-level frameworks and adding robustness. We’ve also released a prototype-level framework that allows the user to perform forensics in “The Cloud” using Apache Hadoop and TSK. Autopsy 3.0, a graphical interface to TSK, was released as a new rewrite that allows for more powerful and efficient analysis. This talk will cover its basic features.


Tuesday, January 17, 2012

Find Better Data and Call Me in the Morning

Basis Technology builds and sells an entity extraction module, Rosette® Entity Extractor (REX). REX uses a combination of machine learning and rules to identify entity mentions in text. People have been using machine learning for this task for quite a long time, both academically and commercially. If you read about the problem, you’re likely to find yourself reading about models and gradient searches and such. What you may not read so much about is the travails of preparing data to train with in the first place.

It's been a humbling experience to realize that engineering skills alone will not make for a great entity extractor. REX uses a combination of methods, but it is the machine-learned statistical model that finds mentions of people, places, and organizations in over a dozen languages. In the years I've worked on our entity extractor, much more time has gone into the quality of our data than has gone into building the natural language processing (NLP) technology itself. If you think that quality data sounds simple, you'll be surprised to learn just how complex we've found it to be.

It's no secret that technologies that “learn” from data are sensitive to their training data. To begin with, the system can only learn what you show it. If you train a system on well-formed news articles, you can hardly expect it to work very well on tweets. In the trade, we call this “matching the domain of the training data.” A much stickier problem, however, is that machine-learning systems can be surprisingly sensitive to small details of the training inputs. For example, if you scrape news articles from the web, you're likely to pick up some repetitive text. Many technologies are purposely sensitive to repetition because they're trying to learn from patterns.

I ran into a “bug” once where I realized that a prototype was incorrectly labeling words following the word "and" simply because every document in the training corpus had the same last sentence in it—a boilerplate part of the page it came from. That noisy sentence “taught” the system something funny about “and.”

For our customers, idiosyncratic data is only half of the problem. The product needs to work outside the bubble of a clean corpus and evaluation data set. Very few actual use cases call for analyzing just news articles. We see customers needing to analyze everything from formal manuals to social media to transcriptions of spoken word. Then factor in the need to use the same extraction model over time (as prose, topics, and entity mentions change) and the quality of data quickly becomes quite a headache. Even if we didn’t have to worry about small data-quality problems ballooning into cranky models, we’d still be running as fast as we can just to maintain training sets that apply to our customers’ needs.

For years we've been developing solutions to all of these problems. Initially we made important changes to our learning technology to make it much more robust to obscurities in our text and much more relevant for future use. We used a combination of supervised and unsupervised learning methods to increase overall accuracy. We used active learning to find better data that we could learn more from. We built tools to check our data against itself for accuracy—to achieve inter-annotator agreement. And we've tested the end product over and over on countless types of inputs to make sure it's performing well.

Recently we've been doing our best to identify our customers’ new needs and expand REX. As more and more companies realize the importance of NLP, we're getting new and more difficult demands. However, the theme of the experience remains the same—good data is the most important piece of the puzzle.

Language and Diversity

Welcome to Basis Technology's blog. Our company started 17 years ago focused on two things: human language and software, and we've been busy ever since working on bringing those two things together, plus we've added other skills such as our digital forensics practice. One of the great things, and greatest challenges, about human language is the diversity it shows—there are many languages around the world (we have people who speak 17 of them), and even within a language there are many ways to represent the same information. In this blog we'll be discussing many of these linguistic issues, as well as the technology approaches we use to make sense of this information.

Our first content post (after this introduction) will be about Entity Extraction—the process of finding entities (references to things in the real world, like people and places) in written text. Entity extractors are used today for applications such as sentiment analysis, fact and concept extraction, e-discovery, social media monitoring, and government DOCEX (document exploitation). The interesting question is how do you get dependable, high-quality named entity recognition? Since our solution "learns" patterns from human-annotated text, part of the answer lies in the quality of the training data. Good training data is both high quality and diverse. We'll discuss what that means next.