FileFinder documentation#
FileFinder allows to specify the structure of filenames using a simple syntax. Parts of the file structure varying from file to file are indicated within named groups, either with format strings or regular expressions (with some pre-defined values for some names). Once setup, it can:
Find corresponding files in a directory (and sub-directories)
Parse values from the filenames
Select only filenames with specific values
Generate filenames
The following example will find all files with the structure Data/[year]/Temperature_[depth]_[date].nc
:
finder = Finder('/.../Data', '%(Y)/Temperature_%(depth:fmt=d)_%(Y)%(m)%(d).nc')
files = finder.get_files()
We can also select only some files, for instance only in january:
finder.fix_group('m', 1)
files = finder.get_files()
We can retrieve values from found files:
filename, matches = finder.files[0]
depth = matches["depth"]
# the date as a datetime object
date = filefinder.library.get_date(matches)
And we can generate a filename with a set of parameters:
finder.make_filename(depth=100, Y=2000, m=1, d=1)
# Specifying the month is optional since we already fixed it to 1.
Installation#
FileFinder can be installed directly from pip:
pip install filefinder
or from source with:
pip install -e https://github.com/Descanonge/filefinder.git#egg=filefinder
Contents#
Usage#
Let’s demonstrate the main features of FileFinder using a simple example. Detailed information about some steps will be provided in separate pages.
We are going to deal with a dataset with multiple files all located in the
directory /data/
. They are organized by sub-directories corresponding
to different parameter values, then in yearly sub-directories:
/data/param_[parameter]/[year]/variable_[date].nc
/data/param_0.0/2012/variable_2012-01-01.nc
/data/param_0.0/2012/variable_2012-01-02.nc
...
/data/param_1.5/2012/variable_2012-01-01.nc
...
Create the Finder object#
To manage this, we are going to use the main entry point of this package: the
Finder
class. Its main arguments are the root directory
containing the files, and a pattern specifying the filename structure.
That pattern allows to get filenames corresponding to given values, but also
to scan for files matching the pattern on disk.
finder = Finder(
"/data/",
"param_%(param:fmt=.1f)/%(Y)/variable_%(Y)-%(m)-%(d).nc"
)
The parts that vary from file to file are indicated in the pattern by
parentheses, preceded by a percent sign. Within the parentheses are
specifications for a Group
, that will handle creating
the regular expression to find files and formatting values appropriately.
Important
Details on the different ways to specify a varying group are available there: Pattern.
Here quickly, for date related parts, we only need to indicate the name: filefinder has them as default. For the parameter, indicating a string format will suffice.
Fix groups#
Each group can be fixed to one possible value or a set of possible values. This will restrict the filenames that match the pattern when scanning files.
Note
Also, when creating filenames, if a group already has a fixed value it will be used by default.
Fixing groups can be done with either the Finder.fix_group()
or
Finder.fix_groups()
methods.
Groups can be selected either by their index in the filename pattern (starting
from 0), or by their name. If using a group name, multiple groups can be fixed
to the same value at once.
The given value can be:
a number: will be formatted to a string according to the group specification. For scanning files, the string will be properly escaped for use in a regular expression.
a boolean: if the group has two options (specified with the bool keyword), one of the options is selected and used as a string.
a string: the value is directly interpreted as a regular expression and used as-is when scanning files or creating filenames, without further escaping or formatting.
a list of any of the above: each element will be formatted to a string if not already. When scanning files, all elements are considered by joining them with OR (
(value1|value2|...)
), and when creating files only the first element of the list is used.
So for example:
>>> finder.fix_group("param", "[a-z]+")
will be kept as is
>>> finder.fix_group("param", 3.)
will be formatted as "3\.0"
More practically, we could keep only the files corresponding to january:
finder.fix_groups("m", 1)
We could also select specific days using a list:
finder.fix_groups(d=[1, 3, 5, 7])
Note
Fixed values can be changed/overwritten at any time, or unfixed using the
Finder.unfix_groups()
method.
Warning
A group flagged as :discard will not be fixed by default,
unless using the keyword argument fix_discard
in fix_group()
and fix_groups()
.
Find files#
Retrieve files#
Files can be retrieved with the Finder.get_files()
method, or from the
Finder.files
attribute. Both will automatically scan the directory for
matching files and cache the results for future accesses. The files are stored
in alphabetical order.
Note
The cache is appropriately voided when using some methods, like for fixing groups. For that reason, avoid setting attributes directly and use set methods.
The method get_files()
simply returns a sorted list of the
filenames found when scanning. By default the full path is returned, ie the
concatenation of the root directory and the pattern part. It can also return the
filename relative to the root directory (ie only the pattern part).
Instead of a flat list of filenames, get_files()
can also arrange
them in nested lists. To that end, one must provide the nested
argument with
a list that specify the order in which groups must be nested. Each element of
the list gives:
a group, by index or name, so that files be grouped together based on the value of that group
multiple groups, by a tuple of indices or names, so files are grouped based on the combination of values from those groups.
An example might help to grasp this. Again with the same pattern, we can ask to group by values of ‘param’:
>>> finder.get_files(nested=["param"])
[
[
"/data/param_0.0/2012-01-01.nc",
"/data/param_0.0/2012-01-02.nc",
...
],
[
"/data/param_1.5/2012-01-01.nc",
"/data/param_1.5/2012-01-02.nc",
...
],
...
]
We obtain as many lists as different values found for ‘param’. Because we did not specify any other group, the nesting stop there. But we could chose to also group by the year:
>>> finder.get_files(nested=["param", "Y"])
[
[ # param = 0
[ # Y = 2012
"/data/param_0.0/2012-01-01.nc",
...
],
[ # Y = 2013
"/data/param_0.0/2013-01-01.nc",
...
],
...
],
[ # param = 1.5
...
],
...
]
Or if we wanted to group by date as well we can specify multiple groups for one nesting level:
>>> finder.get_files(nested=["param", ("Y", "m", "d")])
[
[ # param = 0
["/data/param_0.0/2012-01-01.nc"],
["/data/param_0.0/2012-01-02.nc"],
...
],
[ # param = 1.5
["/data/param_1.5/2012-01-01.nc"],
["/data/param_1.5/2012-01-02.nc"],
...
],
...
]
Note
This is aimed to work with xarray.open_mfdataset, which will merge files in a specific order when supplied a nested list of files.
Retrieve information#
As some metadata might only be found in the filenames, FileFinder offer the possibility to retrieve it easily. The Finder caches a list of files matching the pattern, along with information about parts that matched the groups.
The Finder.files
attribute stores a list of tuples each containing a
filename and a Matches
object storing that information.
Note
One can also scan any filename for matches with the
Finder.find_matches()
function.
For most cases, the simplest is to access the Matches object with a group index or name:
>>> file, matches = finder.files[0]
>>> matches["param"]
0.0 # a float, parsed from the filename
This method has several caveats:
When using a group name, the first group in the pattern with that name is taken, even if there could be more groups with different values (a warning is issued if that is the case).
Only groups not flagged as ‘:discard’ will be selected. If no group can be found, an error will be raised.
The parsing of a value from the filename can fail for a variety of reasons, if that is the case, an error will be raised.
To counter those, one can use Matches.get_values()
which will return
a list of values corresponding to the selected group(s). It has arguments
keep_discard
and parse
to choose whether keep discarded groups and
whether to use the parsed value or solely the string that matched.
Matches.get_value()
will return the first element of that list, raise if
the list is empty, and warn if the values are not all equal.
Note
matches[key]
is a thin wrapper around
matches.get_value(key, parse=True, keep_discard=False)
.
As date/time values are scattered among multiple groups, the package supply the
function library.get_date()
to easily retrieve a
datetime
object from matches:
from filefinder.library import get_date
matches = finder.get_matches(filename)
date = get_date(matches)
Directories in pattern#
The pattern can contain directory separators. The Finder
can explore
sub-directories to find the files.
Important
In the pattern, a directory separator should always be indicated with the
forward slash /
, even on Windows where we normally use the backslash. It
will be replaced by the correct character when necessary.
We do this because the backslash has special meanings in regular expressions, and it is difficult to disambiguate the two.
The scanning process is as follows. It first generates a regular expression based on the pattern and the fixed values. This expression is meant to match paths relative to the root directory and have a capturing group for each pattern group.
The Finder then explore all sub-directories to find matching files using one of two methods.
By default, the regular expression is split at each path separator occurrence, so that we can eliminate folders that do not match the pattern. However, it cannot deal with some patterns in which a group contains a path separator.
For those more complicated patterns, by setting the attribute/parameter
Finder.scan_everything
to true, we will explore all sub-directories up to a depth ofFinder.max_scan_depth
.
The second method can be more costly for some directory structures —with many siblings folders for instance— but can deal with more exotic patterns. A likely example could be that of an optional directory:
>>> "basedir/%(subdir:bool=subdir_name/:)rest_of_pattern"
basedir/rest_of_pattern
basedir/subdir_name/rest_of_pattern
Create filenames#
Using the information contained in the filename pattern we can also generate
arbitrary filenames. This is done with Finder.make_filename()
. Any group
that does not already have its value fixed must have a value
supplied as argument.
As for fixing, a value will be appropriately formatted but a string will be
left untouched.
So for instance:
>>> finder.make_filename(param=1.5, Y=2012, m=1, d=5)
"/data/param_1.5/2012-01-05.nc"
we can also fix some groups:
>>> finder.fix_groups(param=2., Y=2014)
>>> finder.make_filename(m=5, d=1)
"/data/param_2.0/2014-05-01.nc"
>>> finder.make_filename(m=6, d=1)
"/data/param_2.0/2014-06-01.nc"
and also supply a string to forgo formatting:
>>> finder.make_filename(param="this-feels-wrong", m=6, d=1)
"/data/param_this-feels-wrong/2014-06-01.nc"
Pattern#
The pattern specifies the structure of the filenames relative to the root
directory. Parts that vary from file to file are indicated by groups,
enclosed by parenthesis and preceded by ‘%’. They are represented by the
Group
class.
Each group definition starts with a name, and is then followed by multiple optional properties, separated by colons (in no particular order):
Property |
Format |
Description |
---|---|---|
|
Use a python format string to match this group in filenames. |
|
|
Choose between two alternatives. The second option (false) can be omitted. |
|
|
Specify a custom regular expression directly. |
|
|
Mark the group as optional. |
|
|
Discard the value parsed from this group when retrieving information. |
So for instance, we can specify a filename pattern that will match an integer padded with zeros, followed by two possible options:
>>> "parameter_%(param:fmt=04d)_type_%(type:bool=foo:bar).txt"
parameter_0012_type_foo.txt
parameter_2020_type_bar.txt
Note
Groups are uniquely identified by their index in the pattern (starting at 0) and can share the same name. When using a name rather than an index, some functions may return more than one result if they are multiple groups with that name.
Warning
Groups are first found in the pattern by looking at matching parentheses. The pattern should thus have balanced parentheses or unexpected behavior can occur.
Name#
The name of the group will dictate the regex and format string used for that
group (unless overridden the ‘fmt’ and ‘rgx’ properties).
The Group.DEFAULT_GROUPS
class attribute will make the correspondence between name and regex:
Name |
Regex |
Format |
|
---|---|---|---|
F |
Date (YYYY-MM-DD) |
%Y-%m-%d |
s |
x |
Date (YYYYMMDD) |
%Y%m%d |
08d |
X |
Time (HHMMSS) |
%H%M%S |
06d |
Y |
Year (YYYY) |
\d{4} |
04d |
m |
Month (MM) |
\d\d |
02d |
d |
Day of month (DD) |
\d\d |
02d |
j |
Day of year (DDD) |
\d{3} |
03d |
B |
Month name |
[a-zA-Z]* |
s |
H |
Hour 24 (HH) |
\d\d |
02d |
M |
Minute (MM) |
\d\d |
02d |
S |
Seconds (SS) |
\d\d |
02d |
I |
Index |
\d+ |
d |
text |
Letters |
\w |
s |
char |
Character |
\S* |
s |
Most of them are related to dates and follow the specification of strftime() and strptime() Behavior and strftime.
A letter preceded by a percent sign ‘%’ in the regex will be recursively
replaced by the corresponding name in the table. This can be used in the
custom regex. This still counts as a single group and its name will not
be changed, only the regex.
So %x
will be replaced by %Y%m%d
, in turn replaced by \d{4}\d\d\d\d
.
A percentage character in the regex is escaped by another percentage (‘%%’).
Format string#
All the possible use cases are not covered in the table above. A simple way to specify a group is by using a format string following the Format Mini Language Specification. This will automatically be transformed into a regular expression.
Having a format specified has other benefits: it can be used to convert values
into strings to generate a filename from parameters values (using
Finder.make_filename
), or vice-versa to
parse filenames matches into parameters values.
It’s easy as scale_%(scale:fmt=.1f)
which will find files such as
scale_15.0
or scale_-5.6
. Because we know how to transform a value into
a string we can fix the group directly with a value:
finder.fix_group('scale', 15.)
or we can generate a filename:
>>> finder.make_filename(scale=2.5)
'scale_2.5'
In the opposite direction, we can retrieve a value from a filename:
>>> matches = finder.find_matches('scale_2.5')
>>> print(matches['scale'].get_match())
2.5 # a float
If the format is never specified, it defaults to a s
format.
Warning
Only s, d, f, e, and E format types are supported.
Parsing of numbers will fail in some ambiguous (and quite unrealistic) cases
that involves alignment padding with numbers or the minus signs. Creating a
format object where we can’t unambiguously remove the padding character is
not allowed and will raise a DangerousFormatError
.
Similarly, for a string format (s) it can be impossible to separate correctly the alignment padding character (the “fill”) from the actual value. Here the user is entrusted with making sure the format fill character is adapted to the expected values to parse.
Boolean format#
The boolean format allows to easily select between two strings. It is
specified as :bool=<true>[:<false>]
. The second option (false), can be
omitted.
Here are a couple of examples. my_file%(special:bool=_special).txt
would
match both my_file.txt
and my_file_special.txt
. We could select only
‘special’ files using finder.fix_groups(special=True)
.
We can also specify both options with my_file_%(kind:bool=good:bad).txt
, and
select either like so
>>> finder.make_filename(kind=True)
my_file_good.txt
>>> finder.make_filename(kind=False)
my_file_bad.txt
Optional flag#
The optional flag :opt
marks the group as an optional part of the pattern.
In effect, it appends a ?
to the group regular expression. It does not
affect the group in other ways.
Custom regex#
Finally, one can directly use a regular expression. This will supersede the default regex, or the one generated from the format string if specified.
It can be done like so:
idx_%(idx:rgx=\d+?)
Discard keyword#
Information can be retrieved from the matches in the filename, but one might discard a group so that it is not used. For example for a file of weekly averages with a filename indicating the start and end dates of the average, we might want to only recover the starting date:
sst_%(x)-%(x:discard)
Note
By default, when fixing a group to a value, discarded
groups will not be fixed. This can be overridden with the fix_discard
keyword argument.
Regex outside groups#
By default, special characters (()[]{}?*+-|^$\\.&~# \t\n\r\v\f
) outside of
groups are escaped, and thus not interpreted as a regular expression.
To use regular expressions outside of groups, it is necessary
to pass use_regex=True
when creating the Finder object.
Note
When using regex outside groups,
Finder.make_filename
won’t work.
API References#
Content
Find files using a filename pattern. |
Submodules
Main class. |
|
Generate regex from string format, and parse strings. |
|
Group management. |
|
Functions to retrieve values from filename. |
|
Matches management. |
Source code: Descanonge/filefinder