Detects the top topics in a group of text documents.

Share:

Description

This function returns the top detected topics for a list of submitted text documents. A topic is identified with a key phrase, which can be one or more related words. At least 100 text documents must be submitted, however this API is designed to detect topics across hundreds to thousands of documents. For best performance, limit each document to a short, human written text paragraph such as review, conversation or user feedback.

English is the only language supported at this time.

You can provide a list of stop words to control which words or documents are filtered out. You can also supply a list of topics to exclude from the response. Finally, you can also provide min/max word frequency count thresholds to exclude rare/ubiquitous document topics.

We recommend using the textaDetectTopics function in synchronous mode, in which case it will return only after topic detection has completed. If you decide to call this function in asynchronous mode, you will need to call the textaDetectTopicsStatus function periodically yourself until the Microsoft Cognitive Services server complete topic detection and results become available.

IMPORTANT NOTE: If you're calling textaDetectTopics in synchronous mode within the R console REPL (interactive mode), it will appear as if the console has hanged. This is EXPECTED. The function hasn't crashed. It is simply in "sleep mode", activating itself periodically and then going back to sleep, until the results have become available. In sleep mode, even though it appears "stuck", textaDetectTopics dodesn't use any CPU resources. While the function is operating in sleep mode, you WILL NOT be able to use the console until the function completes. If need to operate the console while topic detection is being performed by the Microsoft Cognitive services servers, you should call textaDetectTopics in asynchronous mode and then call textaDetectTopicsStatus yourself repeteadly afterwards, until results are available.

Note that one transaction is charged per text document submitted.

Internally, this function invokes the Microsoft Cognitive Services Text Analytics REST API documented at https://www.microsoft.com/cognitive-services/en-us/text-analytics/documentation.

You MUST have a valid Microsoft Cognitive Services account and an API key for this function to work properly. See https://www.microsoft.com/cognitive-services/en-us/pricing for details.

Usage

1
2
3
textaDetectTopics(documents, stopWords = NULL, topicsToExclude = NULL,
  minDocumentsPerWord = NULL, maxDocumentsPerWord = NULL,
  resultsPollInterval = 30L, resultsTimeout = 1200L, verbose = FALSE)

Arguments

documents

(character vector) Vector of sentences or documents on which to perform topic detection. At least 100 text documents must be submitted. English is the only language supported at this time.

stopWords

(character vector) Vector of stop words to ignore while performing topic detection (optional)

topicsToExclude

(character vector) Vector of topics to exclude from the response (optional)

minDocumentsPerWord

(integer) Words that occur in less than this many documents are ignored. Use this parameter to help exclude rare document topics. Omit to let the service choose appropriate value. (optional)

maxDocumentsPerWord

(integer) Words that occur in more than this many documents are ignored. Use this parameter to help exclude ubiquitous document topics. Omit to let the service choose appropriate value. (optional)

resultsPollInterval

(integer) Interval (in seconds) at which this function will query the Microsoft Cognitive Services servers for results (optional, default: 30L). If set to 0L, this function will return immediately and you will have to call textaDetectTopicsStatus periodically to collect results. If set to a non-zero integer value, this function will only return after all results have been collected. It does so by repeatedly calling textaDetectTopicsStatus on its own until topic detection has completed. In the latter case, you do not need to call textaDetectTopicsStatus.

resultsTimeout

(integer) Interval (in seconds) at which point this function will give up and stop querying the Microsoft Cognitive Services servers for results (optional, default: 1200L). As soon as all results are available, this function will return them to the caller. If the Microsoft Cognitive Services servers within resultsTimeout seconds, this function will stop polling the servers and return the most current results.

verbose

(logical) If set to TRUE, print every poll status to stdout.

Value

An S3 object of the class textatopics. The results are stored in the results dataframes inside this object. See textatopics for details. In the synchronous case (i.e., the function only returns after completion), the dataframes contain the documents, the topics, and which topics are assigned to which documents. In the asynchonous case (i.e., the function returns immediately), the dataframes contain the documents, their unique identifiers, their current operation status code, but they don't contain the topics yet, nor their assignments. To get the topics and their assignments, you must call textaDetectTopicsStatus until the Microsoft Services servers have completed topic detection.

Author(s)

Phil Ferriere pferriere@hotmail.com

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
## Not run: 
 load("./data/yelpChineseRestaurantReviews.rda")
 set.seed(1234)
 documents <- sample(yelpChReviews$text, 1000)

 tryCatch({

   # Detect top topics in group of documents
   topics <- textaDetectTopics(
     documents,                  # At least 100 documents (English only)
     stopWords = NULL,           # Stop word list (optional)
     topicsToExclude = NULL,     # Topics to exclude (optional)
     minDocumentsPerWord = NULL, # Threshold to exclude rare topics (optional)
     maxDocumentsPerWord = NULL, # Threshold to exclude ubiquitous topics (optional)
     resultsPollInterval = 30L,  # Poll interval (in s, default:30s, use 0L for async)
     resultsTimeout = 1200L,     # Give up timeout (in s, default: 1200s = 20mn)
     verbose = TRUE              # If set to TRUE, print every poll status to stdout
   )

   # Class and structure of topics
   class(topics)
   #> [1] "textatopics"

   str(topics, max.level = 1)
   #> List of 8
   #> $ status          : chr "Succeeded"
   #> $ operationId     : chr "30334a3e1e28406a80566bb76ff04884"
   #> $ operationType   : chr "topics"
   #> $ documents       :'data.frame':  1000 obs. of  2 variables:
   #> $ topics          :'data.frame':  71 obs. of  3 variables:
   #> $ topicAssignments:'data.frame':  502 obs. of  3 variables:
   #> $ json            : chr "{\"status\":\"Succeeded\",\"createdDateTime\": __truncated__ }
   #> $ request         :List of 7
   #> ..- attr(*, "class")= chr "request"
   #> - attr(*, "class")= chr "textatopics"

   # Print results
   topics
   #> textatopics [https://westus.api.cognitive.microsoft.com/text/analytics/ __truncated__ ]
   #> status: Succeeded
   #> operationId: 30334a3e1e28406a80566bb76ff04884
   #> operationType: topics
   #> topics (first 20):
   #> ------------------------
   #>    keyPhrase      score
   #> ---------------- -------
   #>     portions       35
   #>   noodle soup      30
   #>    vegetables      20
   #>       tofu         19
   #>      garlic        17
   #>     Eggplant       15
   #>       Pad          15
   #>      combo         13
   #> Beef Noodle Soup   13
   #>      House         12
   #>      entree        12
   #>     wontons        12
   #>     Pei Wei        12
   #>  mongolian beef    11
   #>       crab         11
   #>      Panda         11
   #>       bean         10
   #>    dumplings        9
   #>     veggies         9
   #>      decor          9
   #> ------------------------

 }, error = function(err) {

   # Print error
   geterrmessage()

 })

## End(Not run)