Document Text Extraction
- class datarobot.models.documentai.document.FeaturesWithSamples(model_id, feature_name, document_task)
- document_task
Alias for field number 2
- feature_name
Alias for field number 1
- model_id
Alias for field number 0
- class datarobot.models.documentai.document.DocumentPageFile(document_page_id, project_id=None, height=0, width=0, download_link=None)
Page of a document as an image file.
- Attributes:
- project_idstr
The identifier of the project which the document page belongs to.
- document_page_idstr
The unique identifier for the document page.
- heightint
The height of the document thumbnail in pixels.
- widthint
The width of the document thumbnail in pixels.
thumbnail_bytes
bytesDocument thumbnail as bytes.
mime_type
strMime image type of the document thumbnail.
- property thumbnail_bytes: bytes
Document thumbnail as bytes.
- Returns:
- bytes
Document thumbnail.
- property mime_type: str
Mime image type of the document thumbnail. Example: ‘image/png’
- Returns:
- str
Mime image type of the document thumbnail.
- class datarobot.models.documentai.document.DocumentThumbnail(project_id, document_page_id, height=0, width=0, target_value=None)
Thumbnail of document from the project’s dataset.
If
Project.stage
isdatarobot.enums.PROJECT_STAGE.EDA2
and it is a supervised project then thetarget_*
attributes of this class will have values, otherwise the values will all be None.- Attributes:
- document: Document
The document object.
- project_idstr
The identifier of the project which the document thumbnail belongs to.
- target_value: str
The target value used for filtering thumbnails.
- classmethod list(project_id, feature_name, target_value=None, offset=None, limit=None)
Get document thumbnails from a project.
- Parameters:
- project_idstr
The identifier of the project which the document thumbnail belongs to.
- feature_namestr
The name of feature that specifies the document type.
- target_valueOptional[str], default
None
The target value to filter thumbnails.
- offsetOptional[int], default
None
The number of documents to be skipped.
- limitOptional[int], default
None
The number of document thumbnails to return.
- Returns:
- documentsList[DocumentThumbnail]
A list of
DocumentThumbnail
objects, each representing a single document.
- Return type:
List
[DocumentThumbnail
]
Notes
Actual document thumbnails are not fetched from the server by this method. Instead the data gets loaded lazily when
DocumentPageFile
object attributes are accessed.Examples
Fetch document thumbnails for the given
project_id
andfeature_name
.from datarobot._experimental.models.documentai.document import DocumentThumbnail # Fetch five documents from the EDA SAMPLE for the specified project and specific feature document_thumbs = DocumentThumbnail.list(project_id, feature_name, limit=5) # Fetch five documents for the specified project with target value filtering # This option is only available after selecting the project target and starting modeling target1_thumbs = DocumentThumbnail.list(project_id, feature_name, target_value='target1', limit=5)
Preview the document thumbnail.
from datarobot._experimental.models.documentai.document import DocumentThumbnail from datarobot.helpers.image_utils import get_image_from_bytes # Fetch 3 documents document_thumbs = DocumentThumbnail.list(project_id, feature_name, limit=3) for doc_thumb in document_thumbs: thumbnail = get_image_from_bytes(doc_thumb.document.thumbnail_bytes) thumbnail.show()
- class datarobot.models.documentai.document.DocumentTextExtractionSample
Stateless class for computing and retrieving Document Text Extraction Samples.
Notes
Actual document text extraction samples are not fetched from the server in the moment of a function call. Detailed information on the documents, the pages and the rendered images of them are fetched when accessed on demand (lazy loading).
Examples
1) Compute text extraction samples for a specific model, and fetch all existing document text extraction samples for a specific project.
from datarobot._experimental.models.documentai.document import DocumentTextExtractionSample SPECIFIC_MODEL_ID1 = "model_id1" SPECIFIC_MODEL_ID2 = "model_id2" SPECIFIC_PROJECT_ID = "project_id" # Order computation of document text extraction sample for specific model. # By default `compute` method will await for computation to end before returning DocumentTextExtractionSample.compute(SPECIFIC_MODEL_ID1, await_completion=False) DocumentTextExtractionSample.compute(SPECIFIC_MODEL_ID2) samples = DocumentTextExtractionSample.list_features_with_samples(SPECIFIC_PROJECT_ID)
2) Fetch document text extraction samples for a specific model_id and feature_name, and display all document sample pages.
from datarobot._experimental.models.documentai.document import DocumentTextExtractionSample from datarobot.helpers.image_utils import get_image_from_bytes SPECIFIC_MODEL_ID = "model_id" SPECIFIC_FEATURE_NAME = "feature_name" samples = DocumentTextExtractionSample.list_pages( model_id=SPECIFIC_MODEL_ID, feature_name=SPECIFIC_FEATURE_NAME ) for sample in samples: thumbnail = sample.document_page.thumbnail image = get_image_from_bytes(thumbnail.thumbnail_bytes) image.show()
3) Fetch document text extraction samples for specific model_id and feature_name and display text extraction details for the first page. This example displays the image of the document with bounding boxes of detected text lines. It also returns a list of all text lines extracted from page along with their coordinates.
from datarobot._experimental.models.documentai.document import DocumentTextExtractionSample SPECIFIC_MODEL_ID = "model_id" SPECIFIC_FEATURE_NAME = "feature_name" samples = DocumentTextExtractionSample.list_pages(SPECIFIC_MODEL_ID, SPECIFIC_FEATURE_NAME) # Draw bounding boxes for first document page sample and display related text data. image = samples[0].get_document_page_with_text_locations() image.show() # For each text block represented as bounding box object drawn on original image # display its coordinates (top, left, bottom, right) and extracted text value for text_line in samples[0].text_lines: print(text_line)
- classmethod compute(model_id, await_completion=True, max_wait=600)
Starts computation of document text extraction samples for the model and, if successful, returns computed text samples for it. This method allows calculation to continue for a specified time and, if not complete, cancels the request.
- Parameters:
- model_id: str
The identifier of the project’s model that start the creation of the cluster insights.
- await_completion: bool
Determines whether the method should wait for completion before exiting or not.
- max_wait: int (default=600)
The maximum number of seconds to wait for the request to finish before raising an AsyncTimeoutError.
- Raises:
- ClientError
Server rejected creation due to client error. Often, a bad model_id is causing these errors.
- AsyncFailureError
Indicates whether any of the responses from the server are unexpected.
- AsyncProcessUnsuccessfulError
Indicates whether the cluster insights computation failed or was cancelled.
- AsyncTimeoutError
Indicates whether the cluster insights computation did not resolve within the specified time limit (max_wait).
- Return type:
None
- classmethod list_features_with_samples(project_id)
Returns a list of features, model_id pairs with computed document text extraction samples.
- Parameters:
- project_id: str
The project ID to retrieve the list of computed samples for.
- Returns:
- List[FeaturesWithSamples]
- Return type:
List
[FeaturesWithSamples
]
- classmethod list_pages(model_id, feature_name, document_index=None, document_task=None)
Returns a list of document text extraction sample pages.
- Parameters:
- model_id: str
The model identifier.
- feature_name: str
The specific feature name to retrieve.
- document_index: Optional[int]
The specific document index to retrieve. Defaults to None.
- document_task: Optional[str]
The document blueprint task.
- Returns:
- List[DocumentTextExtractionSamplePage]
- Return type:
- classmethod list_documents(model_id, feature_name)
Returns a list of documents used for text extraction.
- Parameters:
- model_id: str
The model identifier.
- feature_name: str
The feature name.
- Returns:
- List[DocumentTextExtractionSampleDocument]
- Return type:
- class datarobot.models.documentai.document.DocumentTextExtractionSampleDocument(document_index, feature_name, thumbnail_id, thumbnail_width, thumbnail_height, thumbnail_link, document_task, actual_target_value=None, prediction=None)
Document text extraction source.
Holds data that contains feature and model prediction values, as well as the thumbnail of the document.
- Attributes:
- document_index: int
The index of the document page sample.
- feature_name: str
The name of the feature that the document text extraction sample is related to.
- thumbnail_id: str
The document page ID.
- thumbnail_width: int
The thumbnail image width.
- thumbnail_height: int
The thumbnail image height.
- thumbnail_link: str
The thumbnail image download link.
- document_task: str
The document blueprint task that the document belongs to.
- actual_target_value: Optional[Union[str, int, List[str]]]
The actual target value.
- prediction: Optional[PredictionType]
Prediction values and labels.
- classmethod list(model_id, feature_name, document_task=None)
List available documents with document text extraction samples.
- Parameters:
- model_id: str
The identifier for the model.
- feature_name: str
The name of the feature,
- document_task: Optional[str]
The document blueprint task.
- Returns:
- List[DocumentTextExtractionSampleDocument]
- Return type:
- class datarobot.models.documentai.document.DocumentTextExtractionSamplePage(page_index, document_index, feature_name, document_page_id, document_page_width, document_page_height, document_page_link, text_lines, document_task, actual_target_value=None, prediction=None)
Document text extraction sample covering one document page.
Holds data about the document page, the recognized text, and the location of the text in the document page.
- Attributes:
- page_index: int
Index of the page inside the document
- document_index: int
Index of the document inside the dataset
- feature_name: str
The name of the feature that the document text extraction sample belongs to.
- document_page_id: str
The document page ID.
- document_page_width: int
Document page width.
- document_page_height: int
Document page height.
- document_page_link: str
Document page link to download the document page image.
- text_lines: List[Dict[str, Union[int, str]]]
A list of text lines and their coordinates.
- document_task: str
The document blueprint task that the page belongs to.
- actual_target_value: Optional[Union[str, int, List[str]]
Actual target value.
- prediction: Optional[PredictionType]
Prediction values and labels.
- classmethod list(model_id, feature_name, document_index=None, document_task=None)
Returns a list of document text extraction sample pages.
- Parameters:
- model_id: str
The model identifier, used to retrieve document text extraction page samples.
- feature_name: str
The feature name, used to retrieve document text extraction page samples.
- document_index: Optional[int]
The specific document index to retrieve. Defaults to None.
- document_task: Optional[str]
Document blueprint task.
- Returns:
- List[DocumentTextExtractionSamplePage]
- Return type:
- get_document_page_with_text_locations(line_color='blue', line_width=3, padding=3)
Returns the document page with bounding boxes drawn around the text lines as a PIL.Image.
- Parameters:
- line_color: str
The color used to draw a bounding box on the image page. Defaults to blue.
- line_width: int
The line width of the bounding boxes that will be drawn. Defaults to 3.
- padding: int
The additional space left between the text and the bounding box, measured in pixels. Defaults to 3.
- Returns:
- Image
Returns a PIL.Image with drawn text-bounding boxes.
- Return type:
Image