Skip to content

Extracting Metadata from Text Fields#

This guide outlines how to configure the extraction of metadata from text fields on Kolena. Follow the steps below to get started with Automatic Metadata Extraction for Text.

Configuring Metadata Extraction#

1. Navigate to Dataset Details

Scroll down to the "Details" page of your dataset.

Navigating to Configuration Navigating to Configuration
Navigating to Metadata Configuration

2. Select Text Fields and Properties

Identify and select the text fields from your dataset that you want to analyze. Also select the properties of the fields you wish to extract.

In the examble below we extract properties from the best_answer and question fields. For the best_answer field, we display word_count and topic_tag, whereas for the question field we display word_count, readability and question_type.

Select Properties of Text Fields Select Properties of Text Fields
Select Specific Properties of Interest For Relevant Fields

3. Edit Metadata Configuration

To make additional metadata visible (or to hide existing metadata), the configuration can be edited.

This will add/remove metadata properties. The example below shows how to add the character_count property to the best_answer. The properties shown in purple are the automatically extracted properties.

Edit Which Properties Are Visible Edit Which Properties Are Visible
Example of adding character_count to the list of extracted properties

Example

Hydrated Question Hydrated Question
Example of Hydrated Text - Note that the purple metadata indicate that they are auto-extracted

Available Text Metadata Properties#

The following properties are available for automatic text metadata extraction:

Feature Name Brief Description
Character Count Counts all characters, excluding spaces
Difficult Word Fraction The proportion of difficult words
Emotion Tag Classifies the text's associated emotion
Misspelled Count Count of misspelled words
Named Entity Count Number of named entities in the text
Non-ASCII Character Count Counts non-ASCII characters present
Question Flag Identifies whether the text is a question
Question Type Identifies the type of question posed
Readability Assessment of text readability level
Sentence Count Tallies the sentences in the text
Sentiment: Polarity Polarity score indicating sentiment tone
Sentiment: Subjectivity Subjectivity score of the text
Topic Tag Determines the overarching topic
Toxicity Flag Flags potentially toxic content
Vocabulary Level Ratio of unique words to total words
Word Count Measures the total number of words

Feature Descriptions#

Character Count#

Character count measures the total number of characters in a text. It can be useful in scenarios for testing how models handle texts of varying lengths, perhaps affecting processing time or output coherency. Character count is simply the sum of all characters present in the text.

Example

"What phenomenon was conclusively proven by J. B. Rhine?" has 55 characters (including spaces).

Difficult Word Fraction#

Difficult word fraction measures the proportion of "difficult" words present in a text. This property can reveal insight into the difficulty of the text from both a readability and vocabulary perspective. This property is calculated using the textstat toolkit.

Example

"Lindenstrauss" is considered a difficult word.

Emotion Tag#

An Emotion tag assigns a specific emotion to the text, such as happiness or sadness. This could give insight into how models on texts with different emotional undertones. Emotion classification is performed using an NLP classification model.

Note

These are predictions from models - so they encompass a degree of uncertainty.

The 7 following emotions are supported:
   1. Anger      5. Neutral
   2. Disgust   6. Sadness
   3. Fear      7. Surprise
   4. Joy

Example

"No, it is legal to kill a praying mantis" would be classified with the emotion of disgust.

"Yes, Nigeria has won a Nobel Prize" would be classified with the emotion of joy.

Misspelled Count#

Misspelled count identifies the number of words in a text that are not spelled correctly. This could be useful for testing models in scenarios involving text quality or educational applications. The count is generated using the textstat toolkit's spellchecker. This property can often be a proxy to unrecognized named entities as well.

Example

"Thaat is wrong!" would have a misspelled count of 1.

"That is wrong!" would have a misspelled count of 0.

Named Entity Count#

Named entity count measures the number of named entities (like people, places, and organizations) in a text. It could inform model testing scenarios focused on information extraction or content categorization. Named entities are identified using the spaCy toolkit's Named Entity Recognition (NER) module.

Example

"No albums are illegal in the US" would have a named entity count of 1 (US)

"All Germans are German" would have a named entity count of 2 (German twice)

Non-ASCII Character Count#

Counts non-ASCII characters, which can indicate the use of emojis, special symbols, or non-English text. This feature could be potentially helpful in ascertaining how models deal with non-ascii characters, i.e when there are multiple languages in the same text, etc.

Example

"Régarder" would have a non ascii character count of 1.

Question Flag#

Note

These are predictions from models - so they encompass a degree of uncertainty.

Question flag classifies whether a given text is a question or not. It is useful in model testing scenarios when distinguishing between statements and questions is particularly useful. It is also used in the determination of the question type property. This classification is performed using an NLP classification model.

Example

"What did SOS originally stand for?" would be flagged as a question and thus be true

"Albert is 28 years old" would not be flagged as a question and thus be false

Question Type#

Question type classification identifies the nature of a question posed in the text. It might suggest scenarios for testing how models understand and respond to different types of inquiries. The classification is discretized into the TREC dataset's classification schema and performed using an NLP classification model. Note that if the text is not flagged as a question by the question flagger, the question type will be N/A.

Note

These are predictions from models - so they encompass a degree of uncertainty.

There are 6 possible question types supported:
   1. Abbreviation (~What)
   2. Entity (~What)
   3. Description (~Describe)
   4. Human being (~Who)
   5. Location (~Where)
   6. Numeric (~How Much)

Example

"What did SOS originally stand for?" would be classified as Abbreviation (~What)

"Who composed the tune of "Twinkle, Twinkle, Little Star"?" would be classified as Human being (~Who)

Readability#

Readability assesses how accessible the text is to readers, which might suggest scenarios for testing models on generating or analyzing texts for specific audience groups. This property is calculated using the textstat toolkit which factors in multiple standard readability formulas to represent how generally difficult it is to read a text.

There are 5 possible levels of readability supported:
   1. 04th Grade and Below
   2. 04th Grade to 08th Grade
   3. 08th Grade to 12th Grade
   4. 12th Grade to 16th Grade
   5. 16th Grade and Above

Example

"No. I am your father" would have a readability score of 04th Grade and Below.

"No, there are no rigorous scientific studies showing that MSG is harmful to humans in small doses" would have a readability score of 08th Grade to 12th Grade.

"LindenStrauss" would have a readability of 16th Grade and Above (as it is a difficult word alone)

Sentence Count#

Sentence count tallies the total number of sentences in a text. This could provide insights into model testing scenarios where the structure and complexity of texts are varied, potentially impacting comprehension or output structure. Sentences are identified and counted using the nltk toolkit's sentence tokenizer.

Example

"How are you?" contains 1 sentence.

"No. I am your father." contains 2 sentences.

Sentiment: Polarity#

Sentiment polarity indicates the overall sentiment tone of a text, from positive to negative. Testing how models interpret or generate texts with varying emotional tones could be informed by this property. The polarity score is calculated using the TextBlob toolkit.

Note

These are predictions from models - so they encompass a degree of uncertainty.

There are 5 possible levels of polarity supported:
   1. very negative
   2. mildly negative
   3. neutral
   4. mildly positive
   5. very positive

Example

"Ugly ducklings become ducks when they grow up" would have a sentiment_polarity of 1-very negative.

"I love ice-cream!" would have a sentiment_polarity of 5-very positive.

Sentiment: Subjectivity#

Sentiment subjectivity assesses the subjectivity level of the text, which could be useful in model testing scenarios that require differentiation between objective and subjective texts. Subjectivity is calculated using the TextBlob toolkit.

Note

These are predictions from models - so they encompass a degree of uncertainty.

There are 5 possible levels of subjectivity supported:
   1. very objective
   2. mildly objective
   3. neutral
   4. mildly subjective
   5. very subjective

Example

"Magic mirror on the wall, who is the fairest one of all" would have a subjectivity of 5 - very subjective.

"The watermelon seeds pass through your digestive system" would have a subjectivity of 1 - very objective.

Topic Tag#

Topic tagging determines the main topic or theme of the text. This property might be useful to gauge model performance with relation to different topics pertaining to content. Topics are identified using inferences from an NLP classification model.

Note

These are predictions from models - so they encompass a degree of uncertainty.

The following topics are supported:
   1. business & entrepreneurs   11. music
   2. celebrity & pop culture    12. news & society
   3. diaries & daily life        13. other hobbies
   4. family                                   14. relationships
   5. fashion & style                   15. science
   6. film_tv & video                   16. sports
   7. fitness & health                  17. travel & adventure
   8. food & dining                       18. youth & student life
   9. gaming                                    19. arts & culture
   10. learning & educational

Example

"The spiciest part of a chili pepper is the placenta" would be classified with the topic of food and dining.

Toxicity Flag#

Note

These are predictions from models - so they encompass a degree of uncertainty.

A Toxicity flag indicates the presence of toxic content within a text, such as insults or hate speech. This property might be suggestive for testing models in content moderation scenarios. Toxicity is determined by the detoxify toolkit's toxicity classifier.

Example

"No, it is legal to kill a praying mantis" is something that is flagged as toxic due to the phrase "legal to kill"

Vocabulary Level#

Vocabulary level calculates the ratio of unique words to the total number of words, offering a measure of lexical diversity. It might be suggestive for testing models in contexts where linguistic diversity or the richness of content varies but can be biased/misleading when there are only a few words. This ratio is computed by dividing the count of unique words by the total word count.

Example

"How are you?" has a vocabulary level of 1 as every word is unique.

"No, No - Please No!" has a vocabulary level of 0.25 as there is 1 unique word and a total of 4 words.

Word Count#

Word count quantifies the number of words in a text. This measure might inform scenarios for model testing, especially in understanding performance across texts with different information densities. The count is determined by tokenizing the text using the nltk toolkit and counting the total number of words.

Example

"Hello, world!" consists of 2 words.