regex - Matlab regular expressions capture groups with named tokens -


i trying read few text lines file in matlab. using regexp function extract named tokens. while works quite nice in octave cannot same expression work in matlab.

there different kinds of lines want process, like:

line1 = 'attr enabled  true'; line2 = 'attr width  1.2'; line3 = 'attr size  8byte'; 

the regular expression have come looks like:

pattern = '^attr +(?<name>\s+) +(?:(?<number>[+-]?\d+(?:\.\d+)?)(?<unit>[a-z,a-z]*)?|(?<bool>(?:[tt][rr][uu][ee]|[ff][aa][ll][ss][ee])))$' 

when run (in matlab 2016b):

[tokens, matches] = regexp(line1, pattern, 'names', 'match'); 

the result looks like:

tokens  = 0×0 empty struct array fields:              name matches = 0×0 empty cell array 

the result in octave, however, looks like:

tokens = scalar structure containing fields:              name = enabled              number =              unit =              bool = true matches = { [1,1] = attr enabled  true } 

i tested regex regexr.com suggested octave working correctly.

as remove outer capturing group regex pattern:

pattern = '^attr +(?<name>\s+) +(?<number>[+-]?\d+(?:\.\d+)?)(?<unit>[a-z,a-z]*)?|(?<bool>(?:[tt][rr][uu][ee]|[ff][aa][ll][ss][ee]))$' 

matlab outputs:

tokens = struct fields:               bool: 'true'               name: []               number: []               unit: [] matches = { true } 

so matlab starts recognizing other named tokens fields, still name field empty. , furthermore regex no correct alternation anymore... bug concerning capture groups or terribly misunderstand something?

some simple tests suggests matlab not support nested non-capturing groups named params. best work around might use unnamed groups?

x1 = 'apple banana cat';  % named groups work: re1 = regexp(x1, '(?<first>a.+) (?<second>b.+) (?<third>c.+)', 'names')  % non-capturing (unnamed) groups work... re2 = regexp(x1, '(?:a.+) (?<second>b.+) (?<third>c.+)', 'names')  % nested non-capturing group work, not named groups re3 = regexp(x1, '(?:(a.+)) (?<second>b.+) (?<third>c.+)', 'names')         % ok re4 = regexp(x1, '(?:(a.+)) (b.+) (c.+)', 'tokens')                         % ok (unnamed) re5 = regexp(x1, '(?:(?<first>a.+)) (?<second>b.+) (?<third>c.+)', 'names') % not ok 

sadly there no single canonical regexp definition, there lots of flavours. because works octave or regexr.com no guarantee or should work elsewhere, when start getting more exotic regions of regex.

i think might have work around it, though i'd pleased proved wrong!

(ps testing in v2016a, ymmv).

edit: i've tested in both 2016a , 2016b "re4" works , gives same results in both:

>> x1 = 'apple banana cat'; >> re4 = regexp(x1, '(?:(a.+)) (b.+) (c.+)', 'tokens');  >> disp(re4{1}{1}) banana  >> disp(re4{1}{2}) cat 

Comments