Real-time, unstructured data
matching and analysis

Fast extraction and analysis

Turn millions of documents into rich structured information in no time. Sajari has processed over 100 million documents and extracted key information such as names, places, skills, email addresses, phone numbers, dates and more.

Information extraction

No more batch processing. Analyse unstructured data as it is processed in real-time.

Powerful matching

Refine results in real time and ‘teach’ the data matching algorithm which results you prefer.

Self-learning

Use human or algorithm based feedback to teach Sajari to better understand your data.

What is Unstructured Data?

Unstructured data is any data that is not contained within a recognizable structure such as the traditional table (e.g. rows and columns, objects, etc) database format. Extracting actionable information from unstructured data is a difficult issue for most companies as it requires sophisticated language processing algorithms, pattern matching and machine learning to be effective. Sajari offers a solution that does all this and more.

Want to see how fast Sajari can pull out structure from a typically unstructured document? Try the following data extraction example, which supports DOC, DOCX, PDF, RTF and more.

A sample result can be viewed below. Note: no data is stored if you wish to try the demo yourself. This API is also available for HR companies to extract and match resumes/profiles to jobs.

Feature extraction

Sajari uses many different types of feature extraction to create structure from unstructured text. Some of these techniques are outlined below.

Pattern matches

Pattern matching looks for specific patterns in unstructured text. Email addresses, phone numbers and dates are examples that follow very specific expression patterns. Sajari can very efficiently extract a variety of patterns from unstructured data.

Phrase matches

Phrase matching looks for specific pre-determined phrases in unstructured text. The list of phrases can be added to Sajari in CSV format, thus any custom taxonomy can be used. The size of taxonomies can also be very large, Sajari routinely uses taxonomies on the order of 1 million phrases.

Phrase matching is incredibly useful for automatic tagging of documents, etc. The example above shows how "skills", "job titles", etc can be extracted from resumes and jobs. This is not only useful for display purposes, but the extracted entities can also be used as a component in custom matching algorithms. In the example of resume-job matching, the cosine similarity of "skills" between a resume and a job description is very useful in predicting a match score.

Proxy phrase matches

Proxies are similar to phrase matches, except the detected phrase is not added to the document, but rather this is a "proxy" for a different entity to be added. An example is "the bay area", which proxies to "San Francisco, USA" with lat=37.7833 and lng=122.4167. In this case even though the text does not contain a specific location match, or a latitutude - longitude combination, it can be derived.

Machine Learning classifiers

Classifiers are very useful for automatically categorizing input documents into groups. In the above example, "Sales" in the professional summary section is a Naive Bayes classification prediction (from approximately 30 classes) for this input document. Naive Bayes is not the only classifier we use, but it performs very well with unstructured text and as such we use it a lot.

Classifiers are not only great for grouping documents, but they also become incredibly useful for creating match algorithms. Not only can the prediction accuracy be measured, but the contribution of each classifier to the match score can also be measured.