Base Python Rgonomic Patterns

Getting comfortable in a new language is more than the packages you use. Syntactic sugar in base python increases the efficiency, and aesthetics of python code in ways that R users may enjoy in packages like glue and purrr. This post collects a miscellaneous grab bag of tools for wrangling, formatting (f-strings), repeating (list comprehensions), faking data, and saving objects (pickle)
rstats
python
tutorial
Author

Emily Riederer

Published

January 20, 2024

Photo credit to David Clode on Unsplash

In the past few weeks, I’ve been writing about a stack of tools and specific packages like polars that may help R users feel “at home” when working in python due to similiar ergonomics. However, one common snag in switching languages is ramping up on common “recipes” for higher-level workflows (e.g. how to build a sklearn modeling pipeline) but missing a languages’s fundamentals that make writing glue code feel smooth (and dare I say pleasant?) It’s a maddening feeling to get code for a complex task to finish only to have the result wrapped in an object that you can’t suss out how to save or manipulate.

This post goes back to the basics. We’ll briefly reflect on a few aspects of usability that have led to the success of many workflow packages in R. Then, I’ll demonstrate a grab bag of coding patterns in python that make it feel more elegant to connect bits of code into a coherent workflow.

We’ll look at the kind of functionality that you didn’t know to miss until it was gone, you may not be quite sure what to search to figure out how to get it back, and you wonder if it’s even reasonable to hope there’s an analog1. This won’t be anything groundbreaking – just some nuts and bolts. Specifically: helper functions for data and time manipulation, advanced string interpolation, list comprehensions for more functional programming, and object serialization.

What other R ergonomics do we enjoy?

R’s passionate user and developer community has invested a lot in building tools that smooth over rough edges and provide slick, concise APIs to rote tasks. Sepcifically, a number of packages are devoted to:

  • Utility functions: Things that make it easier to “automate the boring stuff” like fs for naviating file systems or lubridate for more semantic date wrangling
  • Formatting functions: Things that help us make things look nice for users like cli and glue to improve human readability of terminal output and string interpolation
  • Efficiency functions: Things that help us write efficient workflows like purrr which provides a concise, typesafe interface for iteration

All of these capabilities are things we could somewhat trivially write ourselves, but we don’t want to and we don’t need to. Fortunately, we don’t need to in python either.

Wrangling Things (Date Manipulation)

I don’t know a data person who loves dates. In the R world, many enjoy lubridate’s wide range of helper functions for cleaning, formatting, and computing on dates.

Python’s datetime module is similarly effective. We can easily create and manage dates in date or datetime classes which make them easy to work with.

import datetime
from datetime import date
today = date.today()
print(today)
type(today)
2024-01-20
datetime.date

Two of the most important functions are strftime() and strptime().

strftime() formats dates into strings. It accepts both a date and the desired string format. Below, we demonstrate by commiting the cardinal sin of writing a date in non-ISO8601.

today_str = datetime.datetime.strftime(today, '%m/%d/%Y')
print(today_str)
type(today_str)
01/20/2024
str

strptime() does the opposite and turns a string encoding a date into an actual date. It can try to guess the format, or we can be nice and provide it guidance.

someday_dtm = datetime.datetime.strptime('2023-01-01', '%Y-%m-%d')
print(someday_dtm)
type(someday_dtm)
2023-01-01 00:00:00
datetime.datetime

Date math is also relatively easy with datetime. For example, you can see we calculate the date difference simply by… taking the difference! From the resulting delta object, we can access the days attribute.

n_days_diff = ( today - someday_dtm.date() )
print(n_days_diff)
type(n_days_diff)
type(n_days_diff.days)
384 days, 0:00:00
int

Formatting Things (f-strings)

R’s glue is beloved for it’s ability to easily combine variables and texts into complex strings without a lot of ugly, nested paste() functions.

python has a number of ways of doing this, but the most readable is the newest: f-strings. Simply put an f before the string and put any variable names to be interpolated in {curly braces}.

name = "Emily"
print(f"This blog post is written by {name}")
This blog post is written by Emily

f-strings also support formatting with formats specified after a colon. Below, we format a long float to round to 2 digits.

proportion = 0.123456789
print(f"The proportion is {proportion:.2f}")
The proportion is 0.12

Any python expression – not just a single variable – can go in curly braces. So, we can instead format that propotion as a percent.

proportion = 0.123456789
print(f"The proportion is {proportion*100:.1f}%")
The proportion is 12.3%

Despite the slickness of f-strings, sometimes other string interpolation approaches can be useful. For example, if all the variables I want to interpolate are in a dictionary (as often will happen, for example, with REST API responses), the string format() method is a nice alternative. It allows us to pass in the dictionary, “unpacking” the argument with **2

result = {
    'dog_name': 'Squeak',
    'dog_type': 'Chihuahua'
}
print("{dog_name} is a {dog_type}".format(**result))
Squeak is a Chihuahua

Application: Generating File Names

Combining what we’ve discussed about datetime and f-strings, here’s a pattern I use frequently. If I am logging results from a run of some script, I might save the results in a file suffixed with the run timestamp. We can generate this easily.

dt_stub = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
file_name = f"output-{dt_stub}.csv"
print(file_name)
output-20240120_071517.csv

Repeating Things (Iteration / Functional Programming)

Thanks in part to a modern-day fiction that for loops in R are inefficient, R users have gravitated towards concise mapping functions for iteration. These can include the *apply() family3, purrr’s map_*() functions, or the parallelized version of either.

Python too has a nice pattern for arbitrary iteration in list comprehensions. For any iterable, we can use a list comprehension to make a list of outputs by processing a list of inputs, with optional conditional and default expressions.

Here are some trivial examples:

l = [1,2,3]
[i+1 for i in l]
[2, 3, 4]
[i+1 for i in l if i % 2 == 1]
[2, 4]
[i+1 if i % 2 == 1 else i for i in l]
[2, 2, 4]

There are also closer analogs to purrr like python’s map() function. map() takes a function and an iterable object and applies the function to each element. Like with purrr, functions can be anonymous (as defined in python with lambda functions) or named. List comprehensions are popular for their concise syntax, but there are many different thoughts on the matter as expressed in this StackOverflow post.

def add_one(i): 
  return i+1

# these are the same
list(map(lambda i: i+1, l))
list(map(add_one, l))
[2, 3, 4]

Application: Simulation

As a (slightly) more realistic(ish) example, let’s consider how list comprehensions might help us conduct a numerical simulation or sensitivity analysis.

Suppose we want to simulate 100 draws from a Bernoulli distribution with different success probabilites and see how close our empirically calculated rate is to the true rate.

We can define the probabilites we want to simulate in a list and use a list comprehension to run the simulations.

import numpy as np
import numpy.random as rnd

probs = [0.1, 0.25, 0.5, 0.75, 0.9]
coin_flips = [ np.mean(np.random.binomial(1, p, 100)) for p in probs ]
coin_flips
[0.05, 0.3, 0.48, 0.77, 0.87]

Alternatively, instead of returning a list of the same length, our resulting list could include whatever we want – like a list of lists! If we wanted to keep the raw simulation results, we could. The following code returns a list of 5 lists - one with the raw simulation results.

coin_flips = [ list(np.random.binomial(1, p, 100)) for p in probs ]
print(f"""
  coin_flips has {len(coin_flips)} elements
  Each element is itself a {type(coin_flips[0])}
  Each element is of length {len(coin_flips[0])}
  """)

  coin_flips has 5 elements
  Each element is itself a <class 'list'>
  Each element is of length 100
  

If one wished, they could then put these into a polars dataframe and pivot those list-of-lists (going from a 5-row dataset to a 500-row dataset)to conduct whatever sort of analysis with want with all the replicates.

import polars as pl

df_flips = pl.DataFrame({'prob': probs, 'flip': coin_flips})
df_flips.explode('flip').glimpse()
Rows: 500
Columns: 2
$ prob <f64> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1
$ flip <i32> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0

We’ll return to list comprehensions in the next section.

Faking Things (Data Generation)

Creating simple miniature datasets is often useful in analysis. When working with a new packages, it’s an important part of learning, developing, debugging, and eventually unit testing. We can easily run our code on a simplified data object where the desired outcome is easy to determine to sanity-check our work, or we can use fake data to confirm our understanding of how a program will handle edge cases (like the diversity of ways different programs handle null values). Simple datasets can also be used and spines and scaffolds for more complex data wrangling tasks (e.g. joining event data onto a date spine).

In R, data.frame() and expand.grid() are go-to functions, coupled with vector generators like rep() and seq(). Python has many similar options.

Fake Datasets

For the simplest of datasets, we can manually write a few entries as with data.frame() in R. Here, we define series in a named dictionary where each dictionary key turns into a column name.

import polars as pl

pl.DataFrame({
  'a': [1,2,3],
  'b': ['x','y','z']
})
shape: (3, 2)
a b
i64 str
1 "x"
2 "y"
3 "z"

If we need longer datasets, we can use helper functions in packages like numpy to generate the series. Methods like arange and linspace work similarly to R’s seq().

import polars as pl
import numpy as np

pl.DataFrame({
  'a': np.arange(stop = 3),
  'b': np.linspace(start = 9, stop = 24, num = 3)
})
shape: (3, 2)
a b
i32 f64
0 9.0
1 16.5
2 24.0

If we need groups in our sample data, we can use np.repeat() which works like R’s rep(each = TRUE).

pl.DataFrame({
  'a': np.repeat(np.arange(stop = 3), 2),
  'b': np.linspace(start = 3, stop = 27, num = 6)
})
shape: (6, 2)
a b
i32 f64
0 3.0
0 7.8
1 12.6
1 17.4
2 22.2
2 27.0

Alternatively, for more control and succinct typing, we can created a nested dataset in polars and explode it out.

(
  pl.DataFrame({
    'a': [1, 2, 3],
    'b': ["a b c", "d e f", "g h i"]
  })
  .with_columns(pl.col('b').str.split(" "))
  .explode('b')
)
shape: (9, 2)
a b
i64 str
1 "a"
1 "b"
1 "c"
2 "d"
2 "e"
2 "f"
3 "g"
3 "h"
3 "i"

Similarly, we could use what we’ve learned about polars list columns and list comprehensions.

a = [1, 2, 3]
b = [ [q*i for q in [1, 2, 3]] for i in a]
pl.DataFrame({'a':a,'b':b}).explode('b')
shape: (9, 2)
a b
i64 i64
1 1
1 2
1 3
2 2
2 4
2 6
3 3
3 6
3 9

In fact, multidimensional list comprehensions can be used to mimic R’s expand.grid() function.

pl.DataFrame(
  [(x, y) for x in range(3) for y in range(3)],
  schema = ['x','y']
  )
shape: (9, 2)
x y
i64 i64
0 0
0 1
0 2
1 0
1 1
1 2
2 0
2 1
2 2

Built-In Data

R has a number of canonical datasets like iris built in to the core language. This can be easy to quickly grab for experimentation4. While base python doesn’t include such capabilities, many of the exact same or similar datasets can be found in seaborn.

seaborn.get_dataset_names() provides the list of available options. Below, we load the Palmers Penguins data and, if you wish, convert it from pandas to polars.

import seaborn as sns
import polars as pl

df_pd = sns.load_dataset('penguins')
df = pl.from_pandas(df_pd)
df.glimpse()
Rows: 344
Columns: 7
$ species           <str> 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie'
$ island            <str> 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen'
$ bill_length_mm    <f64> 39.1, 39.5, 40.3, None, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0
$ bill_depth_mm     <f64> 18.7, 17.4, 18.0, None, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2
$ flipper_length_mm <f64> 181.0, 186.0, 195.0, None, 193.0, 190.0, 181.0, 195.0, 193.0, 190.0
$ body_mass_g       <f64> 3750.0, 3800.0, 3250.0, None, 3450.0, 3650.0, 3625.0, 4675.0, 3475.0, 4250.0
$ sex               <str> 'Male', 'Female', 'Female', None, 'Female', 'Male', 'Female', 'Male', None, None

Saving Things (Object Serialization)

Sometimes, it can be useful to save objects as they existed in RAM in an active programming environment. R users may have experienced this if they’ve used .rds, .rda, or .Rdata files to save individual variables or their entire environment. These objects can often be faster to reload than plaintext and can better preserve information that may be lost in other formats (e.g. storing a dataframe in a way that preserves its datatypes versus writing to a CSV file5 or storing a complex object that can’t be easily reduced to plaintext like a model with training data, hyperparameters, learned tree splits or weights or whatnot for future predictions.) This is called object serializaton6

Python has comparable capabilities in the pickle module. There aren’t really style points here, so I’ve not much to add beyond “this exists” and “read the documentation”. But, at a high level, it looks something like this:

# to write a pickle
with open('my-obj.pickle', 'wb') as handle:
    pickle.dump(my_object, handle, protocol = pickle.HIGHEST_PROTOCOL)

# to read a pickle
my_object = pickle.load(open('my-obj.pickle','rb'))

Footnotes

  1. I defined this odd scope to help limit the infinite number of workflow topics that could be included like “how to write a function” or “how to source code from another script”↩︎

  2. This is called “**kwargs” and works a bit like do.call() in base R. You can read more about it here.↩︎

  3. Speaking of non-ergonomic things in R, the *apply() family is notoriously diverse in its number and order of arguments↩︎

  4. Particularly if you want to set wildly unrealistic expectations for the efficacy of k-means clustering, but I digress↩︎

  5. And yes, you can and should use Parquet and then my example falls apart – but that’s not the point!↩︎

  6. And, if you want to go incredibly deep here, check out this awesome post by Danielle Navarro.↩︎