KDnuggets Home » News » 2016 » Apr » News, Features » HPE Haven OnDemand Text Extraction API Cheat Sheet for Developers ( 14:n32 )

HPE Haven OnDemand Text Extraction API Cheat Sheet for Developers


HPE Haven OnDemand provides a native API based on cURL calls, as well as numerous language-specific APIs, providing maximum flexibility for developers. This cheat sheet will cover the native and Python text extraction APIs.



By Matthew Mayo, KDnuggets.

haven-ondemand-cheat-sheet-header.jpg

Contents

  1. Overview
  2. Getting Text From Images
  3. Getting Text From Audio and Video
  4. Getting Text From Documents, PDFs, and Archives

1. Overview

 
As outlined in a pair of previous posts, HPE Haven OnDemand is a cloud services platform which simplifies how you can interact with data, allowing it to be transformed into an asset anytime, anywhere. HPE Haven OnDemand provides a collection of over 60 machine learning application programming interfaces (APIs) for interacting with structured and unstructured data in a variety of ways.

KDnuggets and HPE Haven OnDemand have teamed up to bring you a Text Extraction API Cheat Sheet, covering numerous methods for obtaining text from different media for your applications.  This cheat sheet specifically covers how to get text from:

As further processing of text after it is extracted is often desired, the cheat sheet also touches on several additional text-processing APIs.

The HPE Haven OnDemand native APIs are POST and GET method-based, and can easily be embedded in any programming language or invoked independently with cURL calls.  To make development even more flexible, Haven OnDemand also provides a number of programming language-specific APIs, including Android, Go, Node.js, R, Ruby and more.

This cheat sheet will consider both the native APIs, demonstrated by cURL calls, and the Python API.  Details regarding installation of all language specific APIs, including Python, can be found in the applicable Github repositories.

API comparison

Note that using HPE Haven OnDemand requires that one register for a free API key.  Also note that HPE Haven OnDemand provides both synchronous and asynchronous API call functionality; some particular APIs support both call types, while others support only one or the other.  You may want to read more about asynchronous API calls.

The cheat sheet does not cover the basics of HPE Haven OnDemand APIs, so you may want to refer to the following previous posts for more information.


2. Getting Text From Images

 
OCR document

The OCR Document API extracts text from an image that you provide.

Using the native Haven OnDemand API, cURL can be used to issue a POST request to the OnDemand platform:

curl -X POST --form "file=@ocr-test-en.png" --form "YOUR_API_KEY" https://api.havenondemand.com/1/api/sync/ocrdocument/v1


This would result in the following:

{
  "text_block": [
    {
      "text": "This is a lot of 12 point text to test the\nocr code and see if it works on all types\nof file format.\nThe quick brown dog jumped over the\nlazy fox. The quick brown dog jumped\nover the lazy fox. The quick brown dog\njumped over the lazy fox. The quick\nbrown dog jumped over the lazy fox.",
      "left": 36,
      "top": 92,
      "width": 582,
      "height": 269
    }
  ]
}


To accomplish the same using the Python API:

# Import Haven OnDemand APIs
from havenondemand.hodclient import *

# Initiate Haven OnDemand client
client = HODClient("YOUR_API_KEY", version="v1")

# Parameters to pass with request
params = {'file': 'ocr-test-en.png'}

# Make the API call (POST request; works with both 'url' and 'file')
# HODApps.OCR_DOCUMENT denotes the API to use
# async=False specifies a synchronous API call
response = client.post_request(params, HODApps.OCR_DOCUMENT, async=False)

# Print the returned text from image embedded in dict response
# Also includes other useful data
print response['text_block'][0]['text']


Note above the use of the params dict to pass parameters to the request. The same could be accompished by building an inline dict inside of the API call, but as parameters grow, organizing them separately is cleaner.

As with all responses, additional data is also available, enclosed in a returned dict. In this case we simply extracted the text.

What about further processing of the results with the Python API? No problem!

# Perform sentiment analysis on result
sent = client.get_request({'text': response}, HODApps.ANALYZE_SENTIMENT, async=False)
print sent['aggregate']['score']

# Returned aggregate score
# ==> -0.568579962053



3. Getting Text From Audio and Video

 
Speech recognition

The Speech Recognition API creates a transcript of the text in an audio or video file. You can then use this output with other Haven OnDemand APIs, such as Concept Extraction or Add to Text Index, to gain further insight and analysis.

Using the native Haven OnDemand API, cURL can be used to issue a POST request to the OnDemand platform:

curl -X POST --form "file=@first-lesson.mp3" --form "apikey=YOUR_API_KEY" https://api.havenondemand.com/1/api/async/recognizespeech/v1


Here, an audio file named "first-lesson.mp3" has been passed to the Speech Recognition API.

The Speech Recognition API is asynchronous only, and so additional API calls are required to check the status and/or get the results of the API call.  The first call, above, returns a job ID, as shown below:

{
    "jobID": "YOUR_JOB_ID"
}


Checking the status of this job is accomplished via the following cURL call:

curl -X POST https://api.havenondemand.com/1/job/status/YOUR_JOB_ID?apikey=YOUR_API_KEY


Getting the results of this job, once finished, is accomplished via the following cURL call:

curl -X POST https://api.havenondemand.com/1/job/result/YOUR_JOB_ID?apikey=YOUR_API_KEY


Both of these return data in JSON format, which can be be captured of further processed as necessary.

Now, let's see this same functionality using the Python API:

# Import Haven OnDemand APIs
from havenondemand.hodclient import *

# Initiate Haven OnDemand client
client = HODClient("YOUR_API_KEY", version="v1")

# Parameters to pass with request
params = {'file': 'obama-victory-speech.mp4'}

# API call (this API is ASYNC only)
response = client.post_request(params, HODApps.RECOGNIZE_SPEECH, async=True)

# Get the job ID for checking asynchonously...
print response['jobID']
jobID = response['jobID']

# Check status (also returns result when finished)
# Preferential to use status over result API to avoid timeout
status = client.get_job_status(jobID)
print status

# Alternatively, check result
result = client.get_job_status(jobID)
print result



4. Getting Text From Documents, PDFs, and Archives

 
File formats

The Text Extraction API extracts metadata and text content from a file that you provide. The API can handle over 500 different file formats. We will use the Text Extract API to extract text from a PDF document, a Microsoft Word document, and a ZIP archive.

Extracting Text From PDF Files

Using the native Haven OnDemand API, cURL can be used to issue a POST request to the OnDemand platform:

curl -X POST --form "file=@kdnuggets-test.pdf" --form "extract_metadata=true" --form "extract_text=true" --form "extract_xmlattributes=false" --form "apikey=YOUR_API_KEY" https://api.havenondemand.com/1/api/sync/extracttext/v1


This would result in the following (an excerpt):

{
  "document": [
    {
      "reference": "kdnuggets-test.pdf",
      "doc_iod_reference": "4733178f00148f4a2a1b573c6cb2674c",
      "app_name": [
        "Skia/PDF m53"
      ],
      "content_type": [
        "application/pdf"
      ],
      "document_attributes": [
        "0"
      ],
      "document_embedded_font_ratio": [
        "0"
      ],
      "document_pct_embedded_font": [
        "0"
      ],
      "document_pct_probability_mismatch": [
        "0"
      ],
      "file_size": [
        78519
      ],
      "page_count": [
        "2"
      ],
      "processing_error_code": [
        "1"
      ],
      "processing_error_description": [
        "Document: Embedded Font"
      ],
      "content": "Top /r/MachineLearning stories this month:

           . . .

           }
    ],
    "md5sum": [
        "f1b75da0ca2341a57316c9c0a375697a"
    ]
}


Now, let's see this same functionality using the Python API (note that this would only print the content of the file, yet more data is returned):

# Import Haven OnDemand APIs 
from havenondemand.hodclient import *

# Initiate Haven OnDemand client
client = HODClient("YOUR_API_KEY", version="v1")

# Parameters to pass with request
params = {'file': 'kdnuggets-test.pdf'}

# API call
response = client.post_request(params, HODApps.EXTRACT_TEXT, async=False)

# Print only the content of the submitted file
# Also includes other useful data
print response['document'][0]['content']


Extracting Text From Word Documents

Using the native Haven OnDemand API, cURL can be used to issue a POST request to the OnDemand platform:

curl -X POST --form "file=@kdnuggets-test.docx" --form "extract_metadata=true" --form "extract_text=true" --form "extract_xmlattributes=false" --form "apikey=YOUR_API_KEY" https://api.havenondemand.com/1/api/sync/extracttext/v1


This would result in the following (an excerpt):

{
  "document": [
    {
      "reference": "kdnuggets-test.docx",
      "doc_iod_reference": "2d761859357918e41912e19cd4818165",
      "content_type": [
        "application/x-ms-word07"
      ],
      "document_attributes": [
        "0"
      ],
      "file_size": [
        6267
      ],
      "content": "Top /r/MachineLearning stories this month:

            . . .

            }
    ],
    "md5sum": [
        "050a4db634241c7057bfd7539f09366f"
    ]
}


This functionality is achievable via the Python API using the same code excerpt as in the PDF text extraction example above, substituting for the appropriate Microsoft Word filename.

Extracting Text From ZIP Archives

Using the native Haven OnDemand API, cURL can be used to issue a POST request to the OnDemand platform:

curl -X POST --form "file=@kdnuggets-test.zip" --form "extract_metadata=true" --form "extract_text=true" --form "extract_xmlattributes=false" --form "apikey=YOUR_API_KEY" https://api.havenondemand.com/1/api/sync/extracttext/v1


This would result in the following (an excerpt):

{
  "document": [
    {
      "reference": "kdnuggets-test.zip",
      "doc_iod_reference": "3573e81cca7695b998cbefc4098b112f",
      "content_type": [
        "application/zip"
      ],
      "document_attributes": [
        "0"
      ],
      "file_size": [
        82375
      ]
    },
    {
      "reference": "kdnuggets-test.zip:kdnuggets-test.docx",
      "parent_iod_reference": "3573e81cca7695b998cbefc4098b112f",
      "doc_iod_reference": "d636d9e2371bfd5ff370059ccae0619f",

      . . .

    },
    {
      "reference": "kdnuggets-test.zip:kdnuggets-test.pdf",
      "parent_iod_reference": "3573e81cca7695b998cbefc4098b112f",
      "doc_iod_reference": "1499d6bfd5621756de8965738b6cf5dc",

      . . .

    }
  ],
  "md5sum": [
    "6c2f26f89baf9a629c45aebc3e4aa5a0"
  ]
}


This functionality is achievable via the Python API using the same code excerpt as in the PDF text extraction example above, substituting for the appropriate ZIP archive filename.

An alternate approach to accessing files within compressed archives would be to first extract the contents of the ZIP file (or any other type of container) using the Expand Container API, which then stores the files for processing by other APIs.

This native API call:

curl -X POST --form "file=@kdnuggets-test.zip" --form "apikey=YOUR_API_KEY" https://api.havenondemand.com/1/api/sync/expandcontainer/v1


would produce the following result:

{
  "files": [
    {
      "name": "kdnuggets-test.docx",
      "reference": "f8af1448-4b9f-406d-b03f-9b035716cf7f-14e78989",
      "size": 6267
    },
    {
      "name": "kdnuggets-test.pdf",
      "reference": "76643f0f-de05-4bb0-9afc-979e96758f8c-14e78989",
      "size": 78519
    }
  ]
}


These extracted files could then be processed in other APIs via their reference.

HPE Haven OnDemand includes various connector APIs, which allow you to retrieve information from external systems and update it through the available APIs. Supported connector types include the local filesystem, web resources, SharePoint repositories, and Dropbox accounts. The connectors make it easy to incorporate data from across numerous systems into Haven OnDemand projects.

Cheat sheet footer

Related:

Note: the article was commissioned by HPE, but written by an independent KDnuggets expert.


Sign Up