Configuration Management in Python

There are a lot of ways to maintain configuration in Python project. I’ve recently learnt to do it in way that is not something new that I’ve found, but it was new to me. I’ll discuss my progression from basic hard-coding constants in a project to this new method.

Lets first consider an example application where there are two steps, each reading an input file and writing to an output file. This application will also use an API for which we need to use an API secret key. We’ll first create the API object using the API key. In the step 1, we’ll load the data then pass it to API to get the results and then the results will be saved. In the next step, we’ll load the saved API results, process it and save the new results. Here’s the python code implementing this logic in cool_config_demo.py.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import myapi
import utils

from pathlib import Path

API_SECRET_KEY = "SECRET_IS_THERE_IS_NO_SECRET"
myapi_obj = myapi(API_SECRET_KEY)

data_dir = Path("./data/")
STEP1_IN_F = data_dir / "1_infile.csv"
STEP1_OUT_F = data_dir / "1_outfile.csv"
STEP2_OUT_F = data_dir / "2_outfile.csv"

# step 1
step1_data = utils.load_data(STEP1_IN_F)
step1_out = myapi_obj.process(step1_data)
utils.save_data(step1_out, STEP1_OUT_F)

# step 2
step2_out = utils.process(step1_out)
utils.save_data(step2_out, STEP2_OUT_F)

The above code will work fine, and in fact, if the actual code size is this then keeping the hard-coding values is actually fine. Keeping that aside, some of the reasons why hard-coding values should be avoided are:

Projects are rarely ever this small and they consequently, will have many more such hard-coded constants.
By keeping hard-coded secrets (for eg., API_SECRET_KEY) inside the main file we risk the possibility of making them public by accidentally pushing them to the version control systems (e.g., Github).
If the steps 1 and 2 were more modular (defined in different modules) and this main file was just gluing them together, how will you handle the config?
- If you define these constants in each individual module, you’ll introduce duplicate code. You’ll be more prone to errors if you miss even a single change. You’ll be less efficient. Imagine editing STEP1_OUT_F file inside step 1 file and again changing it in step 2 file as it’s used as an input in step 2.
- If you define the constants in the main file and send them as function arguments in each step, you will have eliminated the redundancy, but in doing so, you’ve made code less readable if you’ve a lot of such variables to pass. One way to solve this would be to add all these constants inside a dict and then pass that dict around. But this’ll still require us to define the dict inside the main file which will add boilerplate code.

To solve the hard-coding, redundancy and boilerplate code issues, we can save all the constants inside a separate text file (e.g., config.ini) and then load this file inside the main file and then just pass that object around. The issue with this would be to parse the text file to get all the values and in the correct data types. And this is where the python configparser module helps.

Configuration file parser

Python documentation page has a pretty comprehensive reference on the usage of configparser. I’ll just show what changes to do in our toy application.

Here’s our new config - config.ini.

1
2
3
4
5
6
7
8
9
10
[API]
SECRET = SECRET_IS_THERE_IS_NO_SECRET

[DIR]
DATA = ./data/

[FILES]
STEP1_IN_F = 1_infile.csv
STEP1_OUT_F = 1_outfile.csv
STEP2_OUT_F = 2_outfile.csv

And here’s our changed code in cool_config_demo.py.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import myapi
import utils
import configparser

from pathlib import Path

# loading and parsing the config
config = configparser.ConfigParser()
config.read('config.ini')

API_SECRET_KEY = config["API"]["SECRET"]
myapi_obj = myapi(API_SECRET_KEY)

data_dir = Path(config["DIR"]["DATA"])
STEP1_IN_F = data_dir / config["FILES"]["STEP1_IN_F"]
STEP1_OUT_F = data_dir / config["FILES"]["STEP1_OUT_F"]
STEP2_OUT_F = data_dir / config["FILES"]["STEP2_OUT_F"]

# step 1
step1_data = utils.load_data(STEP1_IN_F)
step1_out = myapi_obj.process(step1_data)
utils.save_data(step1_out, STEP1_OUT_F)

# step 2
step2_out = utils.process(step1_out)
utils.save_data(step2_out, STEP2_OUT_F)

With configparser we have the following advantages:

We have eliminated the need for hard-coding the constants inside the main code - cool_config_demo.py.
Any changes in the config (config.ini) will not impact cool_config_demo.py code.
You can reasonably manage your config inside config.ini under different headers. For e.g., FILES sections can be divided into STEP1 and STEP2 headers.

Some of the things which I found difficult or cumbersome to do using configparser:

Lack of automatic datatype inference. You have to use specific getter functions (getint, getfloat, getboolean) or create your own. For example, I don’t want to convert my string paths (data_dir, STEP1_IN_F, STEP1_OUT_F and STEP2_OUT_F) into the pathlib.Path objects inside cool_config_demo.py. I want my config loader to handle that.
The config becomes difficult to manage if it is very large and involves a lot of constants or steps. You’ll have to split it into multiple files and load them individually and do all the parsing stuff.

We can do better! We should be able to handle these issues in an elegant way. And we do that by creating a module for the config itself.

Config module

For each header from the config.ini, I’ll create a new class and import them inside the cool_config_demo.py. Here’s how the new config.py looks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from pathlib import Path


class APIConf:
    SECRET: str

    def __init__(self):
        self.SECRET = "SECRET_IS_THERE_IS_NO_SECRET"


class DirConf:
    data: Path

    def __init__(self):
        self.data_dir = Path("./data/")


class Step1Conf(DirConf):
    in_f: Path
    out_f: Path

    def __init__(self):
        super().__init__()

        self.in_f = self.data_dir / "1_infile.csv"
        self.out_f = self.data_dir / "1_outfile.csv"


class Step2Conf(DirConf):
    in_f: Path
    out_f: Path

    def __init__(self):
        super().__init__()

        step1_fs = Step1Conf()
        self.in_f = step1_fs.out_f
        self.out_f = self.data_dir / "2_outfile.csv"

Here’s how we use the above config inside cool_config_demo.py:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import myapi
import utils
import config


# loading and parsing the config
api_conf = config.APIConf()
API_SECRET_KEY = api_cong.SECRET
myapi_obj = myapi(API_SECRET_KEY)

# step 1
step1_f = config.Step1FileConf()
step1_data = utils.load_data(step1_f.in_f)
step1_out = myapi_obj.process(step1_data)
utils.save_data(step1_out, step1_f.out_f)

# step 2
step2_f = config.Step2FileConf()
step2_out = utils.process(step2_f.in_f)
utils.save_data(step2_out, step2_f.out_f)

Look at how clean our cool_config_demo.py is now! Here’s why I think this method is better than the above two methods-

No unnecessary code to define, load and parse the config inside the main file. If you’ve noticed, I am not using pathlib module inside the cool_config_demo.py now, which, as a consequence, has made the py file much cleaner.
We are defining the datatypes of the variables inside the config now. If you look at the config class definitions (Step1Conf and Step2Conf), all of my paths are pathlib.Path objects now.
If needed you can even validate your config inside the config.py. You can check if a directory/file exists or not.
Config can be managed and organized in a better way without any redundant code. Because of the class inheritance, data_dir is available inside each individual step class. And, since output of step 1 is the input of step 2, I’ve only defined the out file path inside the step 1 class and used that variable inside step 2 class. This way redundancy is gone and there are no chances of missing changing the second after changing the first.
There is another advantage which you’ll appreciate when your code is supposed to be run in different environments (e.g., SIT, Pre-prod or Prod). You’ll make things run on your local, but many constants will change in other testing environments like database urls, HDFS paths, etc. To handle that, we can load shell env variables instead of changing defaults in config.py. Adding the following snippet inside __init__ will achieve that.
```
  self.IGNITE_HOST = os.environ.get("IGNITE_HOST", default="localhost")
```
You can also implement __str__ method in each class to print the whole config in whatever format you want.

The only disadvantage of this method is that it sometimes feels like over-engineering for many small use cases. Although using this method in all my new project has given me many usage patterns.