HOME tokenizer tagger parser ner

Natural language processing

SpaCy Web App

TEXT

TOKENIZER

Segment text, and create Doc objects with the discovered segment boundaries.

TAGGER

After tokenization, spaCy can parse and tag a given Doc. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following “the” in English is most likely a noun.

PARSER

spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or “chunks”.
You can check whether a Doc object has been parsed with the doc.is_parsed attribute, which returns a boolean value. If this attribute is False, the default sentence iterator will raise an exception.
Because the syntactic relations form a tree, every word has exactly one head. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence.

Text: The original token text.
Dep: The syntactic relation connecting child to head.
Head text: The original text of the token head.
Head POS: The part-of-speech tag of the token head.
Children: The immediate syntactic dependents of the token.

TEXT DEP HEAD TEXT HEAD POS Children

NER

SpaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default model identifies a variety of named and numeric entities, including companies, locations, organizations and products.