Finding files

The main entry point of this package is the Finder class. An instance is created using the root directory containing the files, and a pre-regular expression (abbreviated pre-regex or pregex) that will be transformed into a proper regex later.

When asking to find files, the finder will first create a regular-expression out of the pre-regex. It will then recursively find files in the root directory and its subfolders. Only the subfolders matching (part of) the regex will be looked into. Subfolders can simply be indicated in the pre-regex with the standard OS separator. The finder only keeps files that match the full regex.

Pre-regex

The pre-regex specifies the structure of the filenames relative to the root directory. Parts that vary from file to file are indicated by matchers, enclosed by parenthesis and preceded by ‘%’. It is represented by the filefinder.matcher.Matcher class.

Inside the matchers parenthesis can be indicated multiple properties, separated by colons:

  • a group name (optional)

  • a name that will dictate the matcher regex using a correspondance table

  • a format string (optional)

  • an option switch (optional)

  • a custom regex (optional)

  • a keyword that will discard that matcher when retrieving information from a filename (optional)

The full syntax is as follows: %([group:]name[:fmt=format string][:opt[=A:B]][:rgx=custom regex][:discard]).

Note

The matchers are uniquely identified by their index in the pre-regex (starting at 0). Some other functions (see Finder.get_matches()) can use the string [group:]name to find one or more matchers.

Warning

Matchers are first found in the pre-regex by looking at matching parentheses. The pre-regex should thus have balanced parentheses or unexpected behaviour can occur.

Name

The name of the matcher will dictate the regex and format string used for that matcher (unless overriden by a custom regex), and how it will be used by functions that retrieve information from the filename. The Matcher.DEFAULT_ELTS class attribute will make the correspondance between name and regex:

Name

Regex

Format

F

Date (YYYY-MM-DD)

%Y-%m-%d

s

x

Date (YYYYMMDD)

%Y%m%d

08d

X

Time (HHMMSS)

%H%M%S

06d

Y

Year (YYYY)

\d{4}

04d

m

Month (MM)

\d\d

02d

d

Day of month (DD)

\d\d

02d

j

Day of year (DDD)

\d{3}

03d

B

Month name

[a-zA-Z]*

s

H

Hour 24 (HH)

\d\d

02d

M

Minute (MM)

\d\d

02d

S

Seconds (SS)

\d\d

02d

I

Index

\d+

d

text

Letters

\w

s

char

Character

\S*

s

Those are mainly related to datetime. This table mostly follows the strftime format specifications. Matcher with corresponding names will be used by library.get_date() to find the date from the filename.

A letter preceded by a percent sign ‘%’ in the regex will be recursively replaced by the corresponding name in the table. This can be used in the custom regex. This still counts as a single matcher and its name will not be changed, only the regex. So %x will be replaced by %Y%m%d, in turn replaced by \d{4}\d\d\d\d. A percentage character in the regex is escaped by another percentage (‘%%’).

Custom format

All the possible use cases are not covered in the table above. A simple way to specify a matcher is by using a format string following the Format Mini Language Specification. This will automatically be transformed into a regular expression.

Having a format specified has other benefits: it can be used to convert values into strings to generate a filename from parameters values (using Finder.get_filename()), or vice-versa to parse filenames matches into parameters values.

It’s easy as:

scale_%(scale:fmt=.1f)

Warning

Only s, d, f, e, and E format types are supported.

Parsing will fail in some unrealistic cases described in format.Format.parse().

Optional property

The option property can achieve two different features. If the flag :opt is present, this will append a ‘?’ to the matcher regex, making it ‘optional’.

If two options are indicated as :opt=A:B, the regex will be set as an OR between the two options ((A|B)). The matcher can now be fixed using a boolean, that will fix the option B if true, A if false. Either options can be left blank.

See thoses examples:

>>> Finder('', "foo_%(bar:fmt=d:opt).txt").regex
'foo_(-?\d+)?.txt'

>>> f = Finder('', "foo%(bar:opt=:_yes).txt")
>>> f.regex
'foo(|_yes)'
>>> f.fix_matchers(bar=True)
... f.regex
'foo_yes'

Custom regex

Finally, one can directly use a regular expression. This will supersede the default regex, or the one generated from the format string if specified.

It can be done like so:

idx_%(idx:rgx=\d+?)

Discard keyword

Information can be retrieved from the matches in the filename, but one might discard a matcher so that it is not used. For example for a file of weekly averages with a filename indicating the start and end dates of the average, we might want to only recover the starting date:

sst_%(x)-%(x:discard)

Note

By default, when fixing a matcher to a value, discarded matchers will not be fixed. This can be deactivated with the fix_discard keyword.

Group

The group name is completely optional, but it can help differentiate two matchers with the same name. It can also be used to regroup matchers together, for instance when returning nested lists of filename with Finder.get_files(), or when getting a date with library.get_date().

Regex outside matchers

By default, special characters (()[]{}?*+-|^$\\.&~# \t\n\r\v\f) outside of matchers are escaped. To use regular expressions outside of matchers, it is necessary to activate the use_regex argument when creating the Finder object. All characters outside of matchers will then be properly escaped.

Note

When using regex outside matchers, Finder.get_filename() won’t work.

Obtaining files

Files can be retrieved with the Finder.get_files() function, or the Finder.files attribute. Both will scan the directory for files if it has not been done yet. The ‘files’ attribute also stores the matches. See Retrieve information for details on how matches are stored.

Finder.get_files() can also return nested lists of filenames. This is aimed to work with xarray.open_mfdataset(), which will merge files in a specific order when supplied a nested list of files.

To this end, one must specify group names to the nested argument of the same function. The rightmost group will correspond to the innermost level.

An example is available in the examples.