Document Text Extraction

class datarobot.models.documentai.document.FeaturesWithSamples(model_id, feature_name, document_task)
document_task

Alias for field number 2

feature_name

Alias for field number 1

model_id

Alias for field number 0

class datarobot.models.documentai.document.DocumentPageFile(document_page_id, project_id=None, height=0, width=0, download_link=None)

Page of a document as an image file.

Attributes:
project_idstr

The identifier of the project which the document page belongs to.

document_page_idstr

The unique identifier for the document page.

heightint

The height of the document thumbnail in pixels.

widthint

The width of the document thumbnail in pixels.

thumbnail_bytesbytes

Document thumbnail as bytes.

mime_typestr

Mime image type of the document thumbnail.

property thumbnail_bytes: bytes

Document thumbnail as bytes.

Returns:
bytes

Document thumbnail.

property mime_type: str

Mime image type of the document thumbnail. Example: ‘image/png’

Returns:
str

Mime image type of the document thumbnail.

class datarobot.models.documentai.document.DocumentThumbnail(project_id, document_page_id, height=0, width=0, target_value=None)

Thumbnail of document from the project’s dataset.

If Project.stage is datarobot.enums.PROJECT_STAGE.EDA2 and it is a supervised project then the target_* attributes of this class will have values, otherwise the values will all be None.

Attributes:
document: Document

The document object.

project_idstr

The identifier of the project which the document thumbnail belongs to.

target_value: str

The target value used for filtering thumbnails.

classmethod list(project_id, feature_name, target_value=None, offset=None, limit=None)

Get document thumbnails from a project.

Parameters:
project_idstr

The identifier of the project which the document thumbnail belongs to.

feature_namestr

The name of feature that specifies the document type.

target_valueOptional[str], default None

The target value to filter thumbnails.

offsetOptional[int], default None

The number of documents to be skipped.

limitOptional[int], default None

The number of document thumbnails to return.

Returns:
documentsList[DocumentThumbnail]

A list of DocumentThumbnail objects, each representing a single document.

Return type:

List[DocumentThumbnail]

Notes

Actual document thumbnails are not fetched from the server by this method. Instead the data gets loaded lazily when DocumentPageFile object attributes are accessed.

Examples

Fetch document thumbnails for the given project_id and feature_name.

from datarobot._experimental.models.documentai.document import DocumentThumbnail

# Fetch five documents from the EDA SAMPLE for the specified project and specific feature
document_thumbs = DocumentThumbnail.list(project_id, feature_name, limit=5)

# Fetch five documents for the specified project with target value filtering
# This option is only available after selecting the project target and starting modeling
target1_thumbs = DocumentThumbnail.list(project_id, feature_name, target_value='target1', limit=5)

Preview the document thumbnail.

from datarobot._experimental.models.documentai.document import DocumentThumbnail
from datarobot.helpers.image_utils import get_image_from_bytes

# Fetch 3 documents
document_thumbs = DocumentThumbnail.list(project_id, feature_name, limit=3)

for doc_thumb in document_thumbs:
    thumbnail = get_image_from_bytes(doc_thumb.document.thumbnail_bytes)
    thumbnail.show()
class datarobot.models.documentai.document.DocumentTextExtractionSample

Stateless class for computing and retrieving Document Text Extraction Samples.

Notes

Actual document text extraction samples are not fetched from the server in the moment of a function call. Detailed information on the documents, the pages and the rendered images of them are fetched when accessed on demand (lazy loading).

Examples

1) Compute text extraction samples for a specific model, and fetch all existing document text extraction samples for a specific project.

from datarobot._experimental.models.documentai.document import DocumentTextExtractionSample

SPECIFIC_MODEL_ID1 = "model_id1"
SPECIFIC_MODEL_ID2 = "model_id2"
SPECIFIC_PROJECT_ID = "project_id"

# Order computation of document text extraction sample for specific model.
# By default `compute` method will await for computation to end before returning
DocumentTextExtractionSample.compute(SPECIFIC_MODEL_ID1, await_completion=False)
DocumentTextExtractionSample.compute(SPECIFIC_MODEL_ID2)

samples = DocumentTextExtractionSample.list_features_with_samples(SPECIFIC_PROJECT_ID)

2) Fetch document text extraction samples for a specific model_id and feature_name, and display all document sample pages.

from datarobot._experimental.models.documentai.document import DocumentTextExtractionSample
from datarobot.helpers.image_utils import get_image_from_bytes

SPECIFIC_MODEL_ID = "model_id"
SPECIFIC_FEATURE_NAME = "feature_name"

samples = DocumentTextExtractionSample.list_pages(
    model_id=SPECIFIC_MODEL_ID,
    feature_name=SPECIFIC_FEATURE_NAME
)
for sample in samples:
    thumbnail = sample.document_page.thumbnail
    image = get_image_from_bytes(thumbnail.thumbnail_bytes)
    image.show()

3) Fetch document text extraction samples for specific model_id and feature_name and display text extraction details for the first page. This example displays the image of the document with bounding boxes of detected text lines. It also returns a list of all text lines extracted from page along with their coordinates.

from datarobot._experimental.models.documentai.document import DocumentTextExtractionSample

SPECIFIC_MODEL_ID = "model_id"
SPECIFIC_FEATURE_NAME = "feature_name"

samples = DocumentTextExtractionSample.list_pages(SPECIFIC_MODEL_ID, SPECIFIC_FEATURE_NAME)
# Draw bounding boxes for first document page sample and display related text data.
image = samples[0].get_document_page_with_text_locations()
image.show()
# For each text block represented as bounding box object drawn on original image
# display its coordinates (top, left, bottom, right) and extracted text value
for text_line in samples[0].text_lines:
    print(text_line)
classmethod compute(model_id, await_completion=True, max_wait=600)

Starts computation of document text extraction samples for the model and, if successful, returns computed text samples for it. This method allows calculation to continue for a specified time and, if not complete, cancels the request.

Parameters:
model_id: str

The identifier of the project’s model that start the creation of the cluster insights.

await_completion: bool

Determines whether the method should wait for completion before exiting or not.

max_wait: int (default=600)

The maximum number of seconds to wait for the request to finish before raising an AsyncTimeoutError.

Raises:
ClientError

Server rejected creation due to client error. Often, a bad model_id is causing these errors.

AsyncFailureError

Indicates whether any of the responses from the server are unexpected.

AsyncProcessUnsuccessfulError

Indicates whether the cluster insights computation failed or was cancelled.

AsyncTimeoutError

Indicates whether the cluster insights computation did not resolve within the specified time limit (max_wait).

Return type:

None

classmethod list_features_with_samples(project_id)

Returns a list of features, model_id pairs with computed document text extraction samples.

Parameters:
project_id: str

The project ID to retrieve the list of computed samples for.

Returns:
List[FeaturesWithSamples]
Return type:

List[FeaturesWithSamples]

classmethod list_pages(model_id, feature_name, document_index=None, document_task=None)

Returns a list of document text extraction sample pages.

Parameters:
model_id: str

The model identifier.

feature_name: str

The specific feature name to retrieve.

document_index: Optional[int]

The specific document index to retrieve. Defaults to None.

document_task: Optional[str]

The document blueprint task.

Returns:
List[DocumentTextExtractionSamplePage]
Return type:

List[DocumentTextExtractionSamplePage]

classmethod list_documents(model_id, feature_name)

Returns a list of documents used for text extraction.

Parameters:
model_id: str

The model identifier.

feature_name: str

The feature name.

Returns:
List[DocumentTextExtractionSampleDocument]
Return type:

List[DocumentTextExtractionSampleDocument]

class datarobot.models.documentai.document.DocumentTextExtractionSampleDocument(document_index, feature_name, thumbnail_id, thumbnail_width, thumbnail_height, thumbnail_link, document_task, actual_target_value=None, prediction=None)

Document text extraction source.

Holds data that contains feature and model prediction values, as well as the thumbnail of the document.

Attributes:
document_index: int

The index of the document page sample.

feature_name: str

The name of the feature that the document text extraction sample is related to.

thumbnail_id: str

The document page ID.

thumbnail_width: int

The thumbnail image width.

thumbnail_height: int

The thumbnail image height.

thumbnail_link: str

The thumbnail image download link.

document_task: str

The document blueprint task that the document belongs to.

actual_target_value: Optional[Union[str, int, List[str]]]

The actual target value.

prediction: Optional[PredictionType]

Prediction values and labels.

classmethod list(model_id, feature_name, document_task=None)

List available documents with document text extraction samples.

Parameters:
model_id: str

The identifier for the model.

feature_name: str

The name of the feature,

document_task: Optional[str]

The document blueprint task.

Returns:
List[DocumentTextExtractionSampleDocument]
Return type:

List[DocumentTextExtractionSampleDocument]

class datarobot.models.documentai.document.DocumentTextExtractionSamplePage(page_index, document_index, feature_name, document_page_id, document_page_width, document_page_height, document_page_link, text_lines, document_task, actual_target_value=None, prediction=None)

Document text extraction sample covering one document page.

Holds data about the document page, the recognized text, and the location of the text in the document page.

Attributes:
page_index: int

Index of the page inside the document

document_index: int

Index of the document inside the dataset

feature_name: str

The name of the feature that the document text extraction sample belongs to.

document_page_id: str

The document page ID.

document_page_width: int

Document page width.

document_page_height: int

Document page height.

document_page_link: str

Document page link to download the document page image.

text_lines: List[Dict[str, Union[int, str]]]

A list of text lines and their coordinates.

document_task: str

The document blueprint task that the page belongs to.

actual_target_value: Optional[Union[str, int, List[str]]

Actual target value.

prediction: Optional[PredictionType]

Prediction values and labels.

classmethod list(model_id, feature_name, document_index=None, document_task=None)

Returns a list of document text extraction sample pages.

Parameters:
model_id: str

The model identifier, used to retrieve document text extraction page samples.

feature_name: str

The feature name, used to retrieve document text extraction page samples.

document_index: Optional[int]

The specific document index to retrieve. Defaults to None.

document_task: Optional[str]

Document blueprint task.

Returns:
List[DocumentTextExtractionSamplePage]
Return type:

List[DocumentTextExtractionSamplePage]

get_document_page_with_text_locations(line_color='blue', line_width=3, padding=3)

Returns the document page with bounding boxes drawn around the text lines as a PIL.Image.

Parameters:
line_color: str

The color used to draw a bounding box on the image page. Defaults to blue.

line_width: int

The line width of the bounding boxes that will be drawn. Defaults to 3.

padding: int

The additional space left between the text and the bounding box, measured in pixels. Defaults to 3.

Returns:
Image

Returns a PIL.Image with drawn text-bounding boxes.

Return type:

Image