my question similar question. in spacy
, can part-of-speech tagging , noun phrase identification separately e.g.
import spacy nlp = spacy.load('en') sentence = 'for instance , consider 1 simple phenomena : question typically followed answer , or explicit statement of inability or refusal answer .' token = nlp(sentence) token_tag = [(word.text, word.pos_) word in token]
output looks like:
[('for', 'adp'), ('instance', 'noun'), (',', 'punct'), ('consider', 'verb'), ('one', 'num'), ('simple', 'adj'), ('phenomena', 'noun'), ...]
for noun phrase or chunk, can noun_chunks
chunk of words follows:
[nc nc in token.noun_chunks] # [instance, 1 simple phenomena, answer, ...]
i'm wondering if there way cluster pos tag based on noun_chunks
output as
[('for', 'adp'), ('instance', 'noun'), # or noun_chunks (',', 'punct'), ('one simple phenomena', 'noun_chunks'), ...]
i figured out how it. basically, can start , end position of noun phrase token follows:
noun_phrase_position = [(s.start, s.end) s in token.noun_chunks] noun_phrase_text = dict([(s.start, s.text) s in token.noun_chunks]) token_pos = [(i, t.text, t.pos_) i, t in enumerate(token)]
then combine solution in order merge list of token_pos
based on start
, stop
position
result = [] start, end in noun_phrase_position: result += token_pos[index:start] result.append(token_pos[start:end]) index = end result_merge = [] i, r in enumerate(result): if len(r) > 0 , isinstance(r, list): result_merge.append((r[0][0], noun_phrase_text.get(r[0][0]), 'noun_phrase')) else: result_merge.append(r)
output
[(1, 'instance', 'noun_phrase'), (2, ',', 'punct'), (3, 'consider', 'verb'), (4, 'one simple phenomena', 'noun_phrase'), (7, ':', 'punct'), (8, 'a', 'det'), ...
Comments
Post a Comment