regex - python regular expression : how to remove all punctuation characters from a string but keep those between numbers? -


i working on chinese nlp project. need remove punctuation characters except characters between numbers , remain chinese character(\u4e00-\u9fff),alphanumeric characters(0-9a-za-z).for example,the hyphen in 12-34 should kept while equal mark after 123 should removed.

here python script.

import re s = "中国,中,。》%国foo中¥国bar@中123=国%中国12-34中国" res = re.sub(u'(?<=[^0-9])[^\u4e00-\u9fff0-9a-za-z]+(?=[^0-9])','',s) print(res) 

the expected output should be

中国中国foo中国bar中123国中国12-34中国 

but result is

中国中国foo中国bar中123=国中国12-34中国 

i can't figure out why there equal sign in output?

your regex first check "=" against [^\u4e00-\u9fff0-9a-za-z]+. succeed. check lookbehind , lookahead, must both fail. ie: if 1 of them succeeds, character kept. means code keeps non-alphanumeric, non-chinese characters have numbers on any side.

you can try following regex:

u'([\u4e00-\u9fff0-9a-za-z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-za-z]+(?=[0-9]))' 

you can use such:

import re s = "中国,中,。》%国foo中¥国bar@中123=国%中国12-34中国" res = re.findall(u'([\u4e00-\u9fff0-9a-za-z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-za-z]+(?=[0-9]))',s) print(res.join('')) 

Comments