regex - python regular expression : how to remove all punctuation characters from a string but keep those between numbers? -
i working on chinese nlp project. need remove punctuation characters except characters between numbers , remain chinese character(\u4e00-\u9fff),alphanumeric characters(0-9a-za-z).for example,the hyphen in 12-34 should kept while equal mark after 123 should removed.
here python script.
import re s = "中国,中,。》%国foo中¥国bar@中123=国%中国12-34中国" res = re.sub(u'(?<=[^0-9])[^\u4e00-\u9fff0-9a-za-z]+(?=[^0-9])','',s) print(res)
the expected output should be
中国中国foo中国bar中123国中国12-34中国
but result is
中国中国foo中国bar中123=国中国12-34中国
i can't figure out why there equal sign in output?
your regex first check "="
against [^\u4e00-\u9fff0-9a-za-z]+
. succeed. check lookbehind , lookahead, must both fail. ie: if 1 of them succeeds, character kept. means code keeps non-alphanumeric, non-chinese characters have numbers on any side.
you can try following regex:
u'([\u4e00-\u9fff0-9a-za-z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-za-z]+(?=[0-9]))'
you can use such:
import re s = "中国,中,。》%国foo中¥国bar@中123=国%中国12-34中国" res = re.findall(u'([\u4e00-\u9fff0-9a-za-z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-za-z]+(?=[0-9]))',s) print(res.join(''))
Comments
Post a Comment