Document Text Extraction

class datarobot.models.documentai.document.FeaturesWithSamples(model_id, feature_name, document_task)

document_task: Alias for field number 2

feature_name: Alias for field number 1

model_id: Alias for field number 0

class datarobot.models.documentai.document.DocumentPageFile(document_page_id, project_id=None, height=0, width=0, download_link=None)

Page of a document as an image file.

Attributes:

project_idstr: The identifier of the project which the document page belongs to.
document_page_idstr: The unique identifier for the document page.
heightint: The height of the document thumbnail in pixels.
widthint: The width of the document thumbnail in pixels.
thumbnail_bytesbytes: Document thumbnail as bytes.
mime_typestr: Mime image type of the document thumbnail.

property thumbnail_bytes: bytes

Document thumbnail as bytes.

Returns:

bytes: Document thumbnail.

property mime_type: str

Mime image type of the document thumbnail. Example: ‘image/png’

Returns:

str: Mime image type of the document thumbnail.

class datarobot.models.documentai.document.DocumentThumbnail(project_id, document_page_id, height=0, width=0, target_value=None)

Thumbnail of document from the project’s dataset.

If Project.stage is datarobot.enums.PROJECT_STAGE.EDA2 and it is a supervised project then the target_* attributes of this class will have values, otherwise the values will all be None.

Attributes:

document: Document: The document object.
project_idstr: The identifier of the project which the document thumbnail belongs to.
target_value: str: The target value used for filtering thumbnails.

classmethod list(project_id, feature_name, target_value=None, offset=None, limit=None)

Get document thumbnails from a project.

Parameters:

project_idstr: The identifier of the project which the document thumbnail belongs to.
feature_namestr: The name of feature that specifies the document type.
target_valueOptional[str], default None: The target value to filter thumbnails.
offsetOptional[int], default None: The number of documents to be skipped.
limitOptional[int], default None: The number of document thumbnails to return.

Returns:

documentsList[DocumentThumbnail]: A list of DocumentThumbnail objects, each representing a single document.

Return type:

List[DocumentThumbnail]

Notes

Actual document thumbnails are not fetched from the server by this method. Instead the data gets loaded lazily when DocumentPageFile object attributes are accessed.

Examples

Fetch document thumbnails for the given project_id and feature_name.

from datarobot._experimental.models.documentai.document import DocumentThumbnail

# Fetch five documents from the EDA SAMPLE for the specified project and specific feature
document_thumbs = DocumentThumbnail.list(project_id, feature_name, limit=5)

# Fetch five documents for the specified project with target value filtering
# This option is only available after selecting the project target and starting modeling
target1_thumbs = DocumentThumbnail.list(project_id, feature_name, target_value='target1', limit=5)

Preview the document thumbnail.

from datarobot._experimental.models.documentai.document import DocumentThumbnail
from datarobot.helpers.image_utils import get_image_from_bytes

# Fetch 3 documents
document_thumbs = DocumentThumbnail.list(project_id, feature_name, limit=3)

for doc_thumb in document_thumbs:
    thumbnail = get_image_from_bytes(doc_thumb.document.thumbnail_bytes)
    thumbnail.show()

class datarobot.models.documentai.document.DocumentTextExtractionSample

Stateless class for computing and retrieving Document Text Extraction Samples.

Notes

Actual document text extraction samples are not fetched from the server in the moment of a function call. Detailed information on the documents, the pages and the rendered images of them are fetched when accessed on demand (lazy loading).

Examples

1) Compute text extraction samples for a specific model, and fetch all existing document text extraction samples for a specific project.

from datarobot._experimental.models.documentai.document import DocumentTextExtractionSample

SPECIFIC_MODEL_ID1 = "model_id1"
SPECIFIC_MODEL_ID2 = "model_id2"
SPECIFIC_PROJECT_ID = "project_id"

# Order computation of document text extraction sample for specific model.
# By default `compute` method will await for computation to end before returning
DocumentTextExtractionSample.compute(SPECIFIC_MODEL_ID1, await_completion=False)
DocumentTextExtractionSample.compute(SPECIFIC_MODEL_ID2)

samples = DocumentTextExtractionSample.list_features_with_samples(SPECIFIC_PROJECT_ID)

2) Fetch document text extraction samples for a specific model_id and feature_name, and display all document sample pages.

from datarobot._experimental.models.documentai.document import DocumentTextExtractionSample
from datarobot.helpers.image_utils import get_image_from_bytes

SPECIFIC_MODEL_ID = "model_id"
SPECIFIC_FEATURE_NAME = "feature_name"

samples = DocumentTextExtractionSample.list_pages(
    model_id=SPECIFIC_MODEL_ID,
    feature_name=SPECIFIC_FEATURE_NAME
)
for sample in samples:
    thumbnail = sample.document_page.thumbnail
    image = get_image_from_bytes(thumbnail.thumbnail_bytes)
    image.show()

3) Fetch document text extraction samples for specific model_id and feature_name and display text extraction details for the first page. This example displays the image of the document with bounding boxes of detected text lines. It also returns a list of all text lines extracted from page along with their coordinates.

from datarobot._experimental.models.documentai.document import DocumentTextExtractionSample

SPECIFIC_MODEL_ID = "model_id"
SPECIFIC_FEATURE_NAME = "feature_name"

samples = DocumentTextExtractionSample.list_pages(SPECIFIC_MODEL_ID, SPECIFIC_FEATURE_NAME)
# Draw bounding boxes for first document page sample and display related text data.
image = samples[0].get_document_page_with_text_locations()
image.show()
# For each text block represented as bounding box object drawn on original image
# display its coordinates (top, left, bottom, right) and extracted text value
for text_line in samples[0].text_lines:
    print(text_line)

classmethod compute(model_id, await_completion=True, max_wait=600)

Starts computation of document text extraction samples for the model and, if successful, returns computed text samples for it. This method allows calculation to continue for a specified time and, if not complete, cancels the request.

Parameters:

model_id: str: The identifier of the project’s model that start the creation of the cluster insights.
await_completion: bool: Determines whether the method should wait for completion before exiting or not.
max_wait: int (default=600): The maximum number of seconds to wait for the request to finish before raising an AsyncTimeoutError.

Raises:

ClientError: Server rejected creation due to client error. Often, a bad model_id is causing these errors.
AsyncFailureError: Indicates whether any of the responses from the server are unexpected.
AsyncProcessUnsuccessfulError: Indicates whether the cluster insights computation failed or was cancelled.
AsyncTimeoutError: Indicates whether the cluster insights computation did not resolve within the specified time limit (max_wait).

Return type:

None

classmethod list_features_with_samples(project_id)

Returns a list of features, model_id pairs with computed document text extraction samples.

Parameters:

project_id: str: The project ID to retrieve the list of computed samples for.

Returns:

List[FeaturesWithSamples]

Return type:

List[FeaturesWithSamples]

classmethod list_pages(model_id, feature_name, document_index=None, document_task=None)

Returns a list of document text extraction sample pages.

Parameters:

model_id: str: The model identifier.
feature_name: str: The specific feature name to retrieve.
document_index: Optional[int]: The specific document index to retrieve. Defaults to None.
document_task: Optional[str]: The document blueprint task.

Returns:

List[DocumentTextExtractionSamplePage]

Return type:

List[DocumentTextExtractionSamplePage]

classmethod list_documents(model_id, feature_name)

Returns a list of documents used for text extraction.

Parameters:

model_id: str: The model identifier.
feature_name: str: The feature name.

Returns:

List[DocumentTextExtractionSampleDocument]

Return type:

List[DocumentTextExtractionSampleDocument]

class datarobot.models.documentai.document.DocumentTextExtractionSampleDocument(document_index, feature_name, thumbnail_id, thumbnail_width, thumbnail_height, thumbnail_link, document_task, actual_target_value=None, prediction=None)

Document text extraction source.

Holds data that contains feature and model prediction values, as well as the thumbnail of the document.

Attributes:

document_index: int: The index of the document page sample.
feature_name: str: The name of the feature that the document text extraction sample is related to.
thumbnail_id: str: The document page ID.
thumbnail_width: int: The thumbnail image width.
thumbnail_height: int: The thumbnail image height.
thumbnail_link: str: The thumbnail image download link.
document_task: str: The document blueprint task that the document belongs to.
actual_target_value: Optional[Union[str, int, List[str]]]: The actual target value.
prediction: Optional[PredictionType]: Prediction values and labels.

classmethod list(model_id, feature_name, document_task=None)

List available documents with document text extraction samples.

Parameters:

model_id: str: The identifier for the model.
feature_name: str: The name of the feature,
document_task: Optional[str]: The document blueprint task.

Returns:

List[DocumentTextExtractionSampleDocument]

Return type:

List[DocumentTextExtractionSampleDocument]

class datarobot.models.documentai.document.DocumentTextExtractionSamplePage(page_index, document_index, feature_name, document_page_id, document_page_width, document_page_height, document_page_link, text_lines, document_task, actual_target_value=None, prediction=None)

Document text extraction sample covering one document page.

Holds data about the document page, the recognized text, and the location of the text in the document page.

Attributes:

page_index: int: Index of the page inside the document
document_index: int: Index of the document inside the dataset
feature_name: str: The name of the feature that the document text extraction sample belongs to.
document_page_id: str: The document page ID.
document_page_width: int: Document page width.
document_page_height: int: Document page height.
document_page_link: str: Document page link to download the document page image.
text_lines: List[Dict[str, Union[int, str]]]: A list of text lines and their coordinates.
document_task: str: The document blueprint task that the page belongs to.
actual_target_value: Optional[Union[str, int, List[str]]: Actual target value.
prediction: Optional[PredictionType]: Prediction values and labels.

classmethod list(model_id, feature_name, document_index=None, document_task=None)

Returns a list of document text extraction sample pages.

Parameters:

model_id: str: The model identifier, used to retrieve document text extraction page samples.
feature_name: str: The feature name, used to retrieve document text extraction page samples.
document_index: Optional[int]: The specific document index to retrieve. Defaults to None.
document_task: Optional[str]: Document blueprint task.

Returns:

List[DocumentTextExtractionSamplePage]

Return type:

List[DocumentTextExtractionSamplePage]

get_document_page_with_text_locations(line_color='blue', line_width=3, padding=3)

Returns the document page with bounding boxes drawn around the text lines as a PIL.Image.

Parameters:

line_color: str: The color used to draw a bounding box on the image page. Defaults to blue.
line_width: int: The line width of the bounding boxes that will be drawn. Defaults to 3.
padding: int: The additional space left between the text and the bounding box, measured in pixels. Defaults to 3.

Returns:

Image: Returns a PIL.Image with drawn text-bounding boxes.

Return type:

Image