LegalTech is an “industry conference” where attorneys, lawyers, and IT people meet up and discuss the current and future state of law and IT. Product vendors show their software and tools aimed at making the life of the modern-day attorney easier. As I work on semantic search in eDiscovery, my reasons to attend (being generously invited by Jason Baron) were;

To get a better overview and understanding of eDiscovery (in the US).

To see what people consider the ‘future’ or important topics within eDiscovery.

To understand what the current state of the art is in tools and applications.

(To plug semantic search)

Indeed, in summary, to retrieve information! (As an IR researcher does). The conference included keynotes, conference tracks, panel discussions and a huge exhibitor show where over 100 vendors of eDiscovery-related software present their products. All this fits on just three floors of the beautiful Hilton Midtown Hotel in the middle of New York.

Me@LegalTech

LegalTech is a playground for attorneys and lawyers, not so much PhD students who work on information extraction and semantic search. Needless to say I was far from the typical attendant (possibly the most atypical). But LegalTech proved to be an informative and valuable crash course in eDiscovery for me (I think I can tick the boxes of all 4 of the aforementioned reasons for attending).

The keynotes allowed me to get a better understanding of eDiscovery (a.o., through hearing some of the founders of the eDiscovery world), the panel discussions were very useful in getting an understanding of the open problems, challenges and future directions, and finally the trade show allowed me to get a very complete overview of what is being built and used right now in terms of eDiscovery-supporting software.

I had varying success of talking to vendors about the stuff I was interested in: technology and algorithms behind tools, and choices for including or excluding certain features and functionalities. More frequently than not would an innocently nerdy question from my part be turned around into a software salespitch. To be fair, these people were here to sell, or at least show, so this is hardly unexpected.

The tracks: my observations

During the different tracks and panel discussions I attended, I noticed a couple of things. This is by no means a complete overview of the current things that matter in eDiscovery, but just a personal report of the things I found interesting or noteworthy;

Some of the “open door” recurring themes revolved around the “man vs machine”-debate, trust in algorithms, balance in computer assisted review vs manual review, the intricacies of algorithm performance measurement, and where Moore’s law will bring the law world in 5-10 years. Highly relevant issues for attorneys, lawyers and eDiscovery vendors, but things that I take for granted, and consider the starting point (default win for algorithms!). However, it seems like this is a debate that is not yet settled in this domain, it also seems that while everyone accepts computer assisted review as the unavoidable future, it seems still unclear what this unavoidable future exactly will look like.

On multiple occasions I heard video and image retrieval being mentioned as important future directions for eDiscovery (good news for some colleagues at the University of Amsterdam down the hall). Also, the challenge of privacy and data ownership in a mobile world, where enterprise and personal data are mixed and spread out across iPads, smartphones, laptops and clouds, were identified as major future hurdles.

Finally, in the session titled “Have we Reached a “John Henry” Moment in Evidentiary Search?” the panelists (which included Jason Baron and Ralph Losey) touched upon using eDiscovery tools and algorithms for information governance. Currently, methods are being developed to detect, reconstruct, classify or find events of interest: after the fact. Couldn’t these be used in a predictive setting, instead of a retro-spective one; learning to predict bad stuff before it happens. Interesting stuff.

The tradeshow: metadata-heavy

What I noticed particularly at the trade show was that there was a large overlap both in tools’ functionality and features and their looks and designs. But what I found more striking is the heavy focus on metadata. The tools typically use metadata such as timestamps, authors, and document types to allow users to drill down through a dataset, filtering for time periods, keywords, authors, or a combination of all of these.

Visualizations a plenty, with the most frequent ones being Google Ngrams-ish keyword histograms, and networks (graphs) of interactions between people. What was shocking for an IR/IE person like myself is that typically, once a user is done drilling down to a subset of document, he is designated to prehistoric keyword search to explore and understand the content of the set of documents. Oh no!

But for someone who’s spending 4 years of his life to enabling semantic search in this domain this isn’t worrying, but rather promising! After talking to vendors I learned that plenty of them are interested in these kind of features and functionalities, so there is definitely room for innovation here. (However to be fair, whether the target users agree might be another question).

Highlights

Anyway, this ‘metadata heaviness’ is obviously a gross oversimplification and generalization, and there were definitely some interesting companies that stood out for me. Here’s a small, incomplete, and biased summary;

I had some nice talks with the folks at CatalystSecure, who’s senior applied research scientist and former IR academic (dr. Jeremy Pickens) was the ideal companion to be unashamedly nerdy with, talking about classification performance metrics, challenges in evaluating the “whole package” of the eDiscovery process, and awesome datasets.

RedOwl Analytics do some very impressive stuff with behavioural analytics, where they collect statistics for each ‘author’ in their data (such as number of emails sent and received, ‘time to respond’, number of times cc’ed), to get an ‘average baseline’ of a single dataset (enterprise), that they can use to recognize individuals who deviate from this average. The impressive part was that they were then able to map these deviations to behavioural traits (such as ‘probability of an employee leaving a company’, or on the other side of the spectrum identifying the ‘top employees’ that otherwise remain under the radar). How that works under the hood remains a mystery for me, but the type of questions they were able to answer in the demo were impressive.

Recommind‘s CORE platform seems to rely heavily on topic modeling, and was able to infer topics from datasets. In doing so, Recommind shows we can indeed move beyond keyword search in a real product (and outside of academic papers ;-) ). This doesn’t come as a surprise, seeing that Recommind’s CTO dr. Jan Puzicha is of probabilistic latent semantic indexing (/analysis) fame.

What’s next?

As I hinted at before, I’m missing some more content-heavy functionalities, e.g., (temporal) entity and relation extraction, identity normalization, maybe (multi document) summarization? Conveniently, this is exactly what me and my group are working on! I suppose the eDiscovery world just doesn’t know what they’re missing, yet ;-).

👋 Hello

I am a lead data scientist at the FD Mediagroep, where I lead a team of four data scientists on the award winning BNR SMART Radio, and FD’s SMART Journalism projects. I obtained my PhD in Information Retrieval at ILPS (at the University of Amsterdam) in 2017 under supervision of prof. dr. Maarten de Rijke.