import datetime
from datetime import date
= date.today()
today print(today)
type(today)
2024-01-20
datetime.date
glue
and purrr
. This post collects a miscellaneous grab bag of tools for wrangling, formatting (f-strings), repeating (list comprehensions), faking data, and saving objects (pickle)
Emily Riederer
January 20, 2024
In the past few weeks, I’ve been writing about a stack of tools and specific packages like polars
that may help R users feel “at home” when working in python due to similiar ergonomics. However, one common snag in switching languages is ramping up on common “recipes” for higher-level workflows (e.g. how to build a sklearn
modeling pipeline) but missing a languages’s fundamentals that make writing glue code feel smooth (and dare I say pleasant?) It’s a maddening feeling to get code for a complex task to finish only to have the result wrapped in an object that you can’t suss out how to save or manipulate.
This post goes back to the basics. We’ll briefly reflect on a few aspects of usability that have led to the success of many workflow packages in R. Then, I’ll demonstrate a grab bag of coding patterns in python that make it feel more elegant to connect bits of code into a coherent workflow.
We’ll look at the kind of functionality that you didn’t know to miss until it was gone, you may not be quite sure what to search to figure out how to get it back, and you wonder if it’s even reasonable to hope there’s an analog1. This won’t be anything groundbreaking – just some nuts and bolts. Specifically: helper functions for data and time manipulation, advanced string interpolation, list comprehensions for more functional programming, and object serialization.
R’s passionate user and developer community has invested a lot in building tools that smooth over rough edges and provide slick, concise APIs to rote tasks. Sepcifically, a number of packages are devoted to:
fs
for naviating file systems or lubridate
for more semantic date wranglingcli
and glue
to improve human readability of terminal output and string interpolationpurrr
which provides a concise, typesafe interface for iterationAll of these capabilities are things we could somewhat trivially write ourselves, but we don’t want to and we don’t need to. Fortunately, we don’t need to in python either.
I don’t know a data person who loves dates. In the R world, many enjoy lubridate
’s wide range of helper functions for cleaning, formatting, and computing on dates.
Python’s datetime
module is similarly effective. We can easily create and manage dates in date
or datetime
classes which make them easy to work with.
2024-01-20
datetime.date
Two of the most important functions are strftime()
and strptime()
.
strftime()
formats dates into strings. It accepts both a date and the desired string format. Below, we demonstrate by commiting the cardinal sin of writing a date in non-ISO8601.
01/20/2024
str
strptime()
does the opposite and turns a string encoding a date into an actual date. It can try to guess the format, or we can be nice and provide it guidance.
someday_dtm = datetime.datetime.strptime('2023-01-01', '%Y-%m-%d')
print(someday_dtm)
type(someday_dtm)
2023-01-01 00:00:00
datetime.datetime
Date math is also relatively easy with datetime
. For example, you can see we calculate the date difference simply by… taking the difference! From the resulting delta object, we can access the days
attribute.
R’s glue
is beloved for it’s ability to easily combine variables and texts into complex strings without a lot of ugly, nested paste()
functions.
python has a number of ways of doing this, but the most readable is the newest: f-strings. Simply put an f
before the string and put any variable names to be interpolated in {
curly braces}
.
f-strings also support formatting with formats specified after a colon. Below, we format a long float to round to 2 digits.
Any python expression – not just a single variable – can go in curly braces. So, we can instead format that propotion as a percent.
Despite the slickness of f-strings, sometimes other string interpolation approaches can be useful. For example, if all the variables I want to interpolate are in a dictionary (as often will happen, for example, with REST API responses), the string format()
method is a nice alternative. It allows us to pass in the dictionary, “unpacking” the argument with **
2
result = {
'dog_name': 'Squeak',
'dog_type': 'Chihuahua'
}
print("{dog_name} is a {dog_type}".format(**result))
Squeak is a Chihuahua
Combining what we’ve discussed about datetime
and f-strings, here’s a pattern I use frequently. If I am logging results from a run of some script, I might save the results in a file suffixed with the run timestamp. We can generate this easily.
Thanks in part to a modern-day fiction that for
loops in R are inefficient, R users have gravitated towards concise mapping functions for iteration. These can include the *apply()
family3, purrr
’s map_*()
functions, or the parallelized version of either.
Python too has a nice pattern for arbitrary iteration in list comprehensions. For any iterable, we can use a list comprehension to make a list of outputs by processing a list of inputs, with optional conditional and default expressions.
Here are some trivial examples:
There are also closer analogs to purrr
like python’s map()
function. map()
takes a function and an iterable object and applies the function to each element. Like with purrr
, functions can be anonymous (as defined in python with lambda functions) or named. List comprehensions are popular for their concise syntax, but there are many different thoughts on the matter as expressed in this StackOverflow post.
[2, 3, 4]
As a (slightly) more realistic(ish) example, let’s consider how list comprehensions might help us conduct a numerical simulation or sensitivity analysis.
Suppose we want to simulate 100 draws from a Bernoulli distribution with different success probabilites and see how close our empirically calculated rate is to the true rate.
We can define the probabilites we want to simulate in a list and use a list comprehension to run the simulations.
import numpy as np
import numpy.random as rnd
probs = [0.1, 0.25, 0.5, 0.75, 0.9]
coin_flips = [ np.mean(np.random.binomial(1, p, 100)) for p in probs ]
coin_flips
[0.05, 0.3, 0.48, 0.77, 0.87]
Alternatively, instead of returning a list of the same length, our resulting list could include whatever we want – like a list of lists! If we wanted to keep the raw simulation results, we could. The following code returns a list of 5 lists - one with the raw simulation results.
coin_flips = [ list(np.random.binomial(1, p, 100)) for p in probs ]
print(f"""
coin_flips has {len(coin_flips)} elements
Each element is itself a {type(coin_flips[0])}
Each element is of length {len(coin_flips[0])}
""")
coin_flips has 5 elements
Each element is itself a <class 'list'>
Each element is of length 100
If one wished, they could then put these into a polars
dataframe and pivot those list-of-lists (going from a 5-row dataset to a 500-row dataset)to conduct whatever sort of analysis with want with all the replicates.
import polars as pl
df_flips = pl.DataFrame({'prob': probs, 'flip': coin_flips})
df_flips.explode('flip').glimpse()
Rows: 500
Columns: 2
$ prob <f64> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1
$ flip <i32> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0
We’ll return to list comprehensions in the next section.
Creating simple miniature datasets is often useful in analysis. When working with a new packages, it’s an important part of learning, developing, debugging, and eventually unit testing. We can easily run our code on a simplified data object where the desired outcome is easy to determine to sanity-check our work, or we can use fake data to confirm our understanding of how a program will handle edge cases (like the diversity of ways different programs handle null values). Simple datasets can also be used and spines and scaffolds for more complex data wrangling tasks (e.g. joining event data onto a date spine).
In R, data.frame()
and expand.grid()
are go-to functions, coupled with vector generators like rep()
and seq()
. Python has many similar options.
For the simplest of datasets, we can manually write a few entries as with data.frame()
in R. Here, we define series in a named dictionary where each dictionary key turns into a column name.
a | b |
---|---|
i64 | str |
1 | "x" |
2 | "y" |
3 | "z" |
If we need longer datasets, we can use helper functions in packages like numpy
to generate the series. Methods like arange
and linspace
work similarly to R’s seq()
.
import polars as pl
import numpy as np
pl.DataFrame({
'a': np.arange(stop = 3),
'b': np.linspace(start = 9, stop = 24, num = 3)
})
a | b |
---|---|
i32 | f64 |
0 | 9.0 |
1 | 16.5 |
2 | 24.0 |
If we need groups in our sample data, we can use np.repeat()
which works like R’s rep(each = TRUE)
.
pl.DataFrame({
'a': np.repeat(np.arange(stop = 3), 2),
'b': np.linspace(start = 3, stop = 27, num = 6)
})
a | b |
---|---|
i32 | f64 |
0 | 3.0 |
0 | 7.8 |
1 | 12.6 |
1 | 17.4 |
2 | 22.2 |
2 | 27.0 |
Alternatively, for more control and succinct typing, we can created a nested dataset in polars
and explode it out.
(
pl.DataFrame({
'a': [1, 2, 3],
'b': ["a b c", "d e f", "g h i"]
})
.with_columns(pl.col('b').str.split(" "))
.explode('b')
)
a | b |
---|---|
i64 | str |
1 | "a" |
1 | "b" |
1 | "c" |
2 | "d" |
2 | "e" |
2 | "f" |
3 | "g" |
3 | "h" |
3 | "i" |
Similarly, we could use what we’ve learned about polars
list columns and list comprehensions.
a | b |
---|---|
i64 | i64 |
1 | 1 |
1 | 2 |
1 | 3 |
2 | 2 |
2 | 4 |
2 | 6 |
3 | 3 |
3 | 6 |
3 | 9 |
In fact, multidimensional list comprehensions can be used to mimic R’s expand.grid()
function.
R has a number of canonical datasets like iris
built in to the core language. This can be easy to quickly grab for experimentation4. While base python doesn’t include such capabilities, many of the exact same or similar datasets can be found in seaborn
.
seaborn.get_dataset_names()
provides the list of available options. Below, we load the Palmers Penguins data and, if you wish, convert it from pandas
to polars
.
import seaborn as sns
import polars as pl
df_pd = sns.load_dataset('penguins')
df = pl.from_pandas(df_pd)
df.glimpse()
Rows: 344
Columns: 7
$ species <str> 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie'
$ island <str> 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen'
$ bill_length_mm <f64> 39.1, 39.5, 40.3, None, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0
$ bill_depth_mm <f64> 18.7, 17.4, 18.0, None, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2
$ flipper_length_mm <f64> 181.0, 186.0, 195.0, None, 193.0, 190.0, 181.0, 195.0, 193.0, 190.0
$ body_mass_g <f64> 3750.0, 3800.0, 3250.0, None, 3450.0, 3650.0, 3625.0, 4675.0, 3475.0, 4250.0
$ sex <str> 'Male', 'Female', 'Female', None, 'Female', 'Male', 'Female', 'Male', None, None
Sometimes, it can be useful to save objects as they existed in RAM in an active programming environment. R users may have experienced this if they’ve used .rds
, .rda
, or .Rdata
files to save individual variables or their entire environment. These objects can often be faster to reload than plaintext and can better preserve information that may be lost in other formats (e.g. storing a dataframe in a way that preserves its datatypes versus writing to a CSV file5 or storing a complex object that can’t be easily reduced to plaintext like a model with training data, hyperparameters, learned tree splits or weights or whatnot for future predictions.) This is called object serializaton6
Python has comparable capabilities in the pickle
module. There aren’t really style points here, so I’ve not much to add beyond “this exists” and “read the documentation”. But, at a high level, it looks something like this:
I defined this odd scope to help limit the infinite number of workflow topics that could be included like “how to write a function” or “how to source code from another script”↩︎
This is called “**kwargs” and works a bit like do.call()
in base R. You can read more about it here.↩︎
Speaking of non-ergonomic things in R, the *apply()
family is notoriously diverse in its number and order of arguments↩︎
Particularly if you want to set wildly unrealistic expectations for the efficacy of k-means clustering, but I digress↩︎
And yes, you can and should use Parquet and then my example falls apart – but that’s not the point!↩︎
And, if you want to go incredibly deep here, check out this awesome post by Danielle Navarro.↩︎