Artificial Intelligence

Processing pipelines in spaCy

Manoj Kumar Patra

Date posted: August 29, 2019

Let’s look at the pipeline functions provided by spaCy . This will continue from the previous article Large scale data analysis with spaCy .

What happens when we call `nlp`?

doc = nlp("Hello world!")

First, the text is transformed into a Doc object by the tokenizer . Then a series of pipeline components is applied to the Doc in order to process it and set attributes. Finally, the processed Doc is returned.

Built-in pipeline components

A pipeline component is a function or callable that takes a doc , modifies and returns it, so that it can be processed by the next component in the pipeline.

Name	Description	Creates
tagger	Part-of-speech tagger	Token.tag
parser	Dependency parser	Token.dep, Token.head,Doc.sentsDoc.Doc.nounchunksnounchunks
ner	Named entity recognizer	Doc.ents, Token.entiob, Token.enttype.
textcat	Text classifier	Doc.cats

NOTE: The text classifier is not included in any of the pre-trained models by default as categories are very specific.

Under the hood

Pipelines are defined in model’s meta.json in order
Built-in components need binary data to make predictions.
The data is included in the model package and loaded into the component when we load the model.

Pipeline attributes

nlp.pipe_names : list of pipeline component names
nlp.pipeline : list of (name, component) tuples

import spacy

nlp = spacy.load("en_core_web_md")


print(nlp.pipe_names)

print(nlp.pipeline)

OUTPUT

['tagger', 'parser', 'ner']
[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7fd284a8bda0>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7fd23a74aca8>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7fd23a74ad08>)]

Custom pipeline components_58

We can add custom functions to the pipeline which then gets executed when we call the nlp object on a text.

Custom functions can be used to add our own metadata to documents and tokens.

They can also be used to update built-in attributes such as doc.ents

To add a component to the pipeline, use nlp.add_pipe method. This function takes the component to add as the first positional argument. It has also the following keyword arguments that can be used optionally to determine the position of the component in the pipeline:

Argument	Description	Examples
last	If True ,add last	nlp.add_pipe(component, last=True)
First	If True ,add first	nlp.add_pipe(component, First=True)
Before	Add before component	nlp.add_pipe(component, Before=True)
After	Add After component	nlp.add_pipe(component, After=True)

Example:

import spacy

# Create the nlp object
nlp = spacy.load("en_core_web_sm")

# Define a custom component
def custom_component(doc):
  # Print the doc's length
  print('Doc length: ', len(doc))

  # Return the doc object
  return doc

# Add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)

# Print the pipeline component names
print("Pipeline: ", nlp.pipe_names)

# Process the text
doc = nlp("Hello world!")

OUTPUT

Pipeline:  ['custom_component', 'tagger', 'parser', 'ner']
Doc length:  3

Extension attributes

Let’s look at how to add custom attributes (meta data) to the Doc , Token and Span objects.

Custom attributes are available via ._ property. This is done to distinguish them from the built-in attributes.

Attributes need to be registered with the Doc,Token or Span classes using the set_extension method.

import spacy

# Import global classes
from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token and Span
Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)

doc._.title = 'My document'
token._.is_color = True
span._.has_color = False

Types of extension attributes:

Attribute extensions – Set a default value that can be overwritten

Example:

# Set extension on the Token with default value
Token.set_extension('is_color', default=False)

doc = nlp("The sky is blue.")

# Overwrite extension attribute value
doc[3]._.is_color = True

Property extensions – Work like properties in Python. Define a getter and an optional setter function.

Getter is only called when you retrieve the attribute value.

from spacy.tokens import Token

# Define getter function
def get_is_color(token):
 colors = ["red", "yellow", "blue"]
 return token.text in colors

# Set extension on the Token with getter
Token.set_extension("is_color", getter=get_is_color)

doc = nlp("The sky is blue.")

print(doc[3].text, '-', doc[3]._.is_color) # blue - True

Span extensions should almost always use a Getter

from spacy.tokens import Token

# Define getter function
def get_has_color(span):
 colors = ["red", "yellow", "blue"]
 return any(token.text in colors for token in span)

# Set extension on the Span with getter
Span.set_extension('has_color', getter=get_has_color)

doc = nlp("The sky is blue.")

print(doc[1:4]._.has_color, '-', doc[1:4].text) # True - sky is blue
print(doc[0:2]._.has_color, '-', doc[0:2].text) # False - The sky

Method extensions – Makes the extension attribute a callable method.

a. Assign a that becomes available as an object method

b. Lets us pass arguments to the extension function

from spacy.tokens import Doc

# Define method with arguments
def has_token(doc, token_text):
  in_doc = token_text in [token.text for token in doc]
  return in_doc

# Set extension on the Doc with method
Doc.set_extension('has_token', method=has_token)

doc = nlp("The sky is blue.")
print(doc._.has_token("blue"), '- blue') # True - blue
print(doc._.has_token("cloud"), '- cloud') # False - cloud

Scaling and performance

Processing lot of texts

Let’s look at tips and tricks to make spaCy pipelines run as fast as possible and process large volumes of text efficiently.

nlp.pipe:

a. Processes the texts as stream and yields Doc objects
b. Much faster than calling nlp on each text as it batches up the texts.

docs = list(nlp.pipe(LOTS_OF_TEXTS))

Passing in context

Option 1

Setting as_tuples=True on nlp.pipe lets us pass (text, context) tuples
Yields (doc, context) tuples
Useful for associating metadata with the doc

import spacy

nlp = spacy.load("en_core_web_md")

data = [
  ('This is a text', {'id': 1, 'page_number': 15}),
  ('Add another text', {'id': 2, 'page_number': 16})
]

for doc, context in nlp.pipe(data, as_tuples=True):
  print(doc.text, context['page_number'])

OUTPUT

This is a text 15
Add another text 16

Option 2

import spacy
from spacy.tokens import Doc

Doc.set_extension("id", default=None)
Doc.set_extension("page_number", default=None)

nlp = spacy.load("en_core_web_md")

data = [
  ('This is a text', {'id': 1, 'page_number': 15}),
  ('Add another text', {'id': 2, 'page_number': 16})
]

for doc, context in nlp.pipe(data, as_tuples=True):
  doc._.id = context['id']
  doc._.page_number = context['page_number']

Using only the tokenizer

# Converts text to doc before the pipeline components are called
doc = nlp.make_doc("Hello world!")

Disabling pipeline components temporarily

Pipeline components can be disabled using the nlp.disable_pipes context manager.

Example:

To only use the ner to process document, we can temporarily disable tagger and parser as follows:

# Disable tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
  # Process the text and print the entities
  doc = nlp(text)
  print(doc.ents)

The disabled components are automatically restored after the with block.

Browse all categories

Processing pipelines in spaCy

What happens when we call nlp?

Built-in pipeline components

Under the hood

Pipeline attributes

Custom pipeline components_58

Extension attributes

Types of extension attributes:

Scaling and performance

Processing lot of texts

Passing in context

Option 1

Option 2

Using only the tokenizer

Disabling pipeline components temporarily

What happens when we call `nlp`?