The CloudHub360 Developer Hub

Welcome to the CloudHub360 developer hub. You'll find comprehensive guides and documentation to help you start working with CloudHub360 as quickly as possible, as well as support if you get stuck. Let's jump right in!

Recent Posts

Archive

Get access token

Body Params

client_id

string

required

The Client ID of an API Client created using the dashboard

client_secret

string

required

The Client Secret of an API Client created using the dashboard

The access_token property is your access token. The expires_in property specifies the number of seconds in which this access token will expire. You should make a request for a new access token at this point, or a little before.

With every request to the API you should then specify the Authorization header as follows:

Create document

Create a new document and add the supplied file to it. The document can then be classified using a classifier and the document's ID.

posthttps://api.waives.io/documents

The request body should contain the binary contents of the document's file.

The newly created document resource is returned, along with a 201 Created status. The document resource includes the document's ID, which should be used with the Classify document endpoint to classify the document.

The Supported File Types article contains details of all file types supported by Waives, and the maximum file size.

Files embedded resource

The document resource contains an embedded files resource which includes details of the file that the document was created from.

file_type: The type of the file as determined by the API by examining the contents of the file. This will have one of the values listed in the table below.

size: The size of the file in bytes.

sha256: The SHA-256 hash of the file contents.

It is best practice to calculate your own values for size, sha256 and file_type (which in most cases will be a static value) of the file you are submitting and compare these to the values in the response in order to ensure that the file was not corrupted during transmission.

If you know the type of the file and want to validate that the API concurs, you can set the Content-Type header to the MIME-type of the file as shown in the table below. If the file type does not match then the request will be rejected with a 415 response.

RESPONSES

201 The Document was created400 There is no file supplied in the body401 There is no Authorization header or the access token is invalid403 You have reached your maximum number of simultaneous documents415 The Content-Type contains an unsupported type or does not match the actual contents of the file

RESPONSES

201 The Document was read. The results are available from the Get Read Results endpoint.400 No Document ID is specified401 There is no Authorization header or the access token is invalid404 The specified Document does not exist422 The content type of the specified document is not supported for this operation

{
"message": "The content type application/vnd.openxmlformats-officedocument.wordprocessingml.document is not currently supported for reading."
}

Get read (OCR) results

Get the results of a read request, as a searchable PDF or raw OCR text

gethttps://api.waives.io/documents/document_id/reads

Path Params

document_id

string

required

The ID of the document, as returned by a request to Create document

Headers

Accept

string

The file format in which you would like the document's read results. Supported types are described below.

Authorization

string

The OAuth 2.0 Bearer Token provided during token exchange

Before you make a request to this endpoint you should make a request to Read (OCR) document, otherwise you will receive a 404 Not Found response.

You must set an Accept header with a value specifying the format in which the OCR results should be returned, as follows:

Format

Accept header value

Searchable PDF, with OCR text embedded

application/pdf

Raw OCR text

text/plain

Waives document format (use this only in conjunction with Waives support)

application/vnd.waives.resultformats.read+zip

The results are returned in the body of the response in the format requested.

Creating Searchable PDFs from TIFFs or JPEGs

Creation of Searchable PDFs from TIFF, JPEG and JPEG2000 file will be available very soon - if you need this then please get in touch with us via support@waives.io.

RESPONSES

200 The results are available in the format requested and returned in the response body400 No Document ID is specified401 There is no Authorization header or the access token is invalid404 The specified Document does not exist or a Read (OCR) document request has not been made for this Document.

Classify document

Path Params

document_id

string

required

The ID of the document, as returned by a request to Create document

classifier_name

string

required

The name of the classifier to use, as specified when calling Create classifier.

Note that documents created from image files (TIFF, JPEG, JPEG2000) and PDFs that contain only images are automatically read (OCRed) before classification is performed. For small documents this will usually be very quick, but for very large documents you should expect response time to be up to tens of seconds.

The classification result contains several properties with different purposes. You should take care to understand these. The Classification results article explains all the properties and how to interpret them.

If you haven't added samples to the classifier

If you use a classifier before you have added samples to it you will get classification results where the document type and document type scores are null.

RESPONSES

200 The Document was classified400 No Document ID is specified or no Classifier name was specified401 There is no Authorization header or the access token is invalid404 The specified Document does not exist or the specified Classifier does not exist

Path Params

The name of the extractor to use, as specified when calling Create extractor.

Headers

Accept

string

The type of response to return (extraction results or a redaction request)

Overview

This endpoint extracts data from the specified document using an extractor. By default it returns details of the extracted data, but it can also be used to obtain a response that can be passed directly to the Get redacted PDF endpoint to get a PDF with all extracted data redacted.

On-demand reading

Note that documents created from image files (TIFF, JPEG, JPEG2000) and PDFs that contain only images are automatically read (OCRed) before extraction is performed. For small documents this will usually be very quick, but for very large documents you should expect response time to be longer.

Extracting invoice data

To extract invoice data from UK invoices you can use the built-in extractor named waives.invoices.gb. For more information, see Extracting invoice data.

Extracted data results

If the Accept header is not set or is application/vnd.waives.resultformats.extractdata+json, then the response contains details of the data extracted from the document.

The field_results section of the response contains the data extracted from the document. This is an array containing one element for each field in the extractor configuration. Each field looks like this:

value: The value as a non-text type (e.g. Decimal or DateTime), if available

rejected: A flag indicating whether the result should be considered potentially invalid

reject_reason: The reason for rejection of the result

areas: A list of areas from which the result originated

proximity_score: A score indicating how well any proximity rules in the configuration for this field have been met (how close this result is, or isn't, to particular content nearby)

match_score: A score indicating how well the text matched the search criteria

text_score: A score indicating the OCR confidence assigned to the actual text that was extracted

The area co-ordinates are relative to the top left of the page and are in points (1/72 inch). The page number is one-based (i.e. the first page of a document is page 1).

Score properties value range from 0 to 100, where 100 is a perfect score.

Getting a response in redaction request format

This endpoint can also be used to obtain a response that can be passed directly to the Get redacted PDF endpoint to get a PDF with all extracted data redacted.

If the Accept header is application/vnd.waives.requestformats.redact+json then the response you receive will be a redaction request that will redact all data extracted from the document. You can either send this directly in a request to this endpoint or edit it first.

One redaction mark is created for every non-empty result and alternative result for every field.

Each redaction mark is labelled with the extraction field it came from to help you if you want to edit it, for example by removing marks for specific fields.

RESPONSES

200 Data was extracted from the document and results are in the response400 No document ID is specified or no extractor name was specified401 There is no Authorization header or the access token is invalid404 The specified document does not exist or the specified extractor does not exist

Path Params

Body Params

Whether to make the redactions permanent and remove associated text from the PDF. (Default: true)

bookmarks

string

An array of bookmarks to add to the document

Beta endpoint

This endpoint is currently in beta. It is functionally complete, but performance is not yet optimised. You should expect response times in the order of 2000ms. Redacted PDFs do not maintain the compression of the file the document was created from and thus will increase in size.

Supported file types

Redaction is supported for documents created from PDFs or TIFFs.

Redaction request

Adding Marks

The marks property is an array containing one element for each redaction to be made to the document. Each mark object looks like this:

The area co-ordinates are relative to the top left of the page and are in points (1/72 inch). The page number is one-based (i.e. the first page of a document is page 1).

Applying redactions

The apply_marks property controls how redactions are made in the PDF.

If apply_marks is true (the default) then as well as a redaction object being added to the PDF, the image underlying each field area is replaced with a black rectangle and any text in that area is removed. The redaction is permanent and cannot be undone if the PDF is loaded into a PDF editor such as Adobe Acrobat.

If apply_marks is false then a redaction object is added to the PDF but the image and any text in the PDF are left unaltered. The redaction can be reviewed and accepted or deleted in a PDF editor such as Adobe Acrobat. Accepting the redaction in that tool will alter the image and remove the text.

Adding bookmarks

The bookmarks property is an array containing one element for each bookmark to add to the PDF. This is an array containing one element for each area to redact. Each bookmark object looks like this:

The text property specifies the text of the bookmark that will be added. The page_number specifies the page in the document that the bookmark will link to.

Beta

This endpoint is currently in beta. It is functionally complete, but performance is not yet optimised. You should expect response times in the order of 2000ms. Redacted PDFs do not maintain the compression of the file the document was created from and thus will increase in size.

Creating a redaction request based on extraction results

In most cases you will want to redact areas corresponding to the locations of data extracted using the Extract document data endpoint. Rather than building a redaction request manually you can request a response from that endpoint that you can pass straight to this endpoint.

Simply make a request to the Extract document data endpoint, specifying an Accept header with the value application/vnd.waives.requestformats.redact+json. The response you receive will be a redaction request that will redact all data extracted from the document. You can either send this directly in a request to this endpoint or edit it first. Each redaction field is labelled with the extraction field it came from to help you if you want to edit it, removing some fields for example.

PDF Text

The PDF returned in the response will contain any text generated by a read (OCR) operation due to any of the Read, Classify or Extract operations being requested for this document.

RESPONSES

200 The document was redacted and the PDF is in the response body400 One or more properties in the request was invalid. See the response contents for details.401 There is no Authorization header or the access token is invalid404 The specified document does not exist415 Redaction is not supported for documents created from this document's file type

Create classifier

Create a new classifier from an existing classifier or from scratch. Samples must be added to an empty classifier before it can be used to classify documents.

posthttps://api.waives.io/classifiers/classifier_name

Path Params

classifier_name

string

required

The desired name for the classifier

Body Params

file

If you have trained a classifier using Document Studio, include the saved .clf file in the request

Headers

Content-Type

string

The MIME type of the request body. Supported values are application/vnd.waives.classifier+zip and application/octet-stream.

If you do not include a classifier file in your Create Classifier request, an empty classifier will be created with the name you specified. You must add samples to the empty classifier before you can use it for classification. If you try to classify a document using a classifier with no samples added you will get classification results where the document type and document type scores are null. For more information see About Classifiers.

RESPONSES

201 The classifier was created400 There is already a classifier with the specified name401 There is no Authorization header or the access token is invalid

RESPONSES

200 The samples were added to the Classifier400 There is no file supplied in the body, the file supplied is not a ZIP file or the contents of the ZIP file are invalid (details of the exact problem are included in the error response).401 There is no Authorization header or the access token is invalid404 The specified Classifier does not exist415 The Content-Type header is missing or invalid. Currently only application/zip is supported.

The request body should contain the binary contents of the sample file and the Content-Type header should be set to the MIME-type of the file.

Correct use of the "retrain" query parameter

Once samples have been added to a classifier, the classifier must be "trained". During this process the classifier analyses the samples and determines the defining characteristics of each document type. Training can only be done when there are samples (that are not empty) of at least two document types.

For optimal performance of requests to this endpoint you should only train once, when all the samples you intend to add have been added. Training multiple times won't hurt but will make requests slower.

The retrain query parameter can be used to control whether training happens after the sample is added.

When starting from a new (empty) classifier you must always set retrain=false for the first samples until you have added samples for at least two document types.

Ideally you should set retrain=false for all except the very last sample you want to add, so the training is performed only once.

Supported file types

Files of the following file types can be used as samples:

PDFs that contain electronic content

Microsoft Office Word, Excel or PowerPoint documents

Text files

Image files and PDFs without electronic content cannot be used as samples. You should OCR these first and use the resulting documents as samples instead.

Note that you will need to make multiple requests to this endpoint to sufficiently train a classifier. Generally it is easier to use the Add samples from ZIP file endpoint, and you should only use this endpoint if you are tightly integrating training into another system, have very large samples, have a very large number of samples or it is inconvenient to build a ZIP file. The Add samples from ZIP file endpoint is also substantially faster when adding multiple samples.

RESPONSES

200 The sample was added to the Classifier400 There is no file supplied in the body401 There is no Authorization header or the access token is invalid404 The specified Classifier does not exist415 The Content-Type header is missing, contains an unsupported type, does not match the actual contents of the file, or the file is a PDF that does not include content (details of the exact problem are included in the error response).

Create extractor

Path Params

extractor_name

string

required

The desired name for the extractor

The request body should contain the binary contents of an extractor configuration file.

If you have documents you wish to extract data from, please talk to us and we can either create a configuration for you or help you to install the offline extraction configuration tool and train you to use it.

A number of off-the-shelf configurations for extracting header and item data from invoices are available on request from the CloudHub360 team. Versions tuned for various different countries are available.

RESPONSES

201 The extractor was created400 There is already an extractor with the specified name401 There is no Authorization header or the access token is invalid