Chapter 4 of R for Data Science, using polars and python

Author

Nick DiQuattro

Published

September 18, 2023

Introduction

I recently started to become very interested in the polars python package for data analysis. I appreciated that there seemed to be a philosophical viewpoint about how to interact with data first, then tooling built to enable that viewpoint. It reminded me of the thoughtfulness put into the tidyverse in R. I’m very attached to the tidy approach to data and have found it annoying when needing to use clunkier libraries when working in python. With Polars, there seems to be a true python analogue to the tidyverse (while staying pythonic).

To really compare the two frameworks, I’ve decided to translate relevant parts of Hadley’s Wickham’s excellent R for Data Science (2nd edition) to polars.

Translated libraries

It should be mentioned that there exist python libraries that attempt to directly replicate the tidyverse experience. The ones I know about are:

  • tidypolars - Most relevant to this project, replicates dplyr functions with a polars backend.

  • suiba - Another dplyr replication, not polars backed, but offers SQL translation of verbs.

  • rpolars - Going the other direction, the polars framework in R.

For me, I prefer a complete context switch when moving between R and python. I admire the goals of the above packages, but there always seems to be something a little different that ends up confuses me.

Format

These posts will focus on code translations from R for Data Science with limited commentary. I would suggest reading Hadley’s book for context and finding the corresponding section here to see how to perform the equivalent operations with polars.

Translation for Chapter 4

I start with Chapter 4 as that’s when the book really gets into data manipulation. The visualization parts would also be interesting to translate someday, but I’m not sure which visualization library in python I like all that much.

4.1.1 Prerequisites

The nycflights13 dataset is made available as a python package from Michael Chow.

Note also that polars likes us to import helpers for column selection separately.

import polars as pl
import polars.selectors as cs

from nycflights13 import flights

pl.Config.set_tbl_rows(10)  # Make it easier to compare results to book.

flights = pl.from_pandas(flights)  # Convert to polars dataframe.

For future generations, I’m using the following version of polars.

pl.__version__
'0.19.3'

4.1.2 nycflights13

Printing a DataFrame shows the head and tail rows.

flights
shape: (336_776, 19)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
i64 i64 i64 f64 i64 f64 f64 i64 f64 str i64 str str str f64 i64 i64 i64 str
2013 1 1 517.0 515 2.0 830.0 819 11.0 "UA" 1545 "N14228" "EWR" "IAH" 227.0 1400 5 15 "2013-01-01T10:…
2013 1 1 533.0 529 4.0 850.0 830 20.0 "UA" 1714 "N24211" "LGA" "IAH" 227.0 1416 5 29 "2013-01-01T10:…
2013 1 1 542.0 540 2.0 923.0 850 33.0 "AA" 1141 "N619AA" "JFK" "MIA" 160.0 1089 5 40 "2013-01-01T10:…
2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 "B6" 725 "N804JB" "JFK" "BQN" 183.0 1576 5 45 "2013-01-01T10:…
2013 1 1 554.0 600 -6.0 812.0 837 -25.0 "DL" 461 "N668DN" "LGA" "ATL" 116.0 762 6 0 "2013-01-01T11:…
2013 9 30 null 1455 null null 1634 null "9E" 3393 null "JFK" "DCA" null 213 14 55 "2013-09-30T18:…
2013 9 30 null 2200 null null 2312 null "9E" 3525 null "LGA" "SYR" null 198 22 0 "2013-10-01T02:…
2013 9 30 null 1210 null null 1330 null "MQ" 3461 "N535MQ" "LGA" "BNA" null 764 12 10 "2013-09-30T16:…
2013 9 30 null 1159 null null 1344 null "MQ" 3572 "N511MQ" "LGA" "CLE" null 419 11 59 "2013-09-30T15:…
2013 9 30 null 840 null null 1020 null "MQ" 3531 "N839MQ" "LGA" "RDU" null 431 8 40 "2013-09-30T12:…

So great to see glimpse() here too.

flights.glimpse()
Rows: 336776
Columns: 19
$ year           <i64> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013
$ month          <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ day            <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ dep_time       <f64> 517.0, 533.0, 542.0, 544.0, 554.0, 554.0, 555.0, 557.0, 557.0, 558.0
$ sched_dep_time <i64> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600
$ dep_delay      <f64> 2.0, 4.0, 2.0, -1.0, -6.0, -4.0, -5.0, -3.0, -3.0, -2.0
$ arr_time       <f64> 830.0, 850.0, 923.0, 1004.0, 812.0, 740.0, 913.0, 709.0, 838.0, 753.0
$ sched_arr_time <i64> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745
$ arr_delay      <f64> 11.0, 20.0, 33.0, -18.0, -25.0, 12.0, 19.0, -14.0, -8.0, 8.0
$ carrier        <str> 'UA', 'UA', 'AA', 'B6', 'DL', 'UA', 'B6', 'EV', 'B6', 'AA'
$ flight         <i64> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301
$ tailnum        <str> 'N14228', 'N24211', 'N619AA', 'N804JB', 'N668DN', 'N39463', 'N516JB', 'N829AS', 'N593JB', 'N3ALAA'
$ origin         <str> 'EWR', 'LGA', 'JFK', 'JFK', 'LGA', 'EWR', 'EWR', 'LGA', 'JFK', 'LGA'
$ dest           <str> 'IAH', 'IAH', 'MIA', 'BQN', 'ATL', 'ORD', 'FLL', 'IAD', 'MCO', 'ORD'
$ air_time       <f64> 227.0, 227.0, 160.0, 183.0, 116.0, 150.0, 158.0, 53.0, 140.0, 138.0
$ distance       <i64> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733
$ hour           <i64> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6
$ minute         <i64> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0
$ time_hour      <str> '2013-01-01T10:00:00Z', '2013-01-01T10:00:00Z', '2013-01-01T10:00:00Z', '2013-01-01T10:00:00Z', '2013-01-01T11:00:00Z', '2013-01-01T10:00:00Z', '2013-01-01T11:00:00Z', '2013-01-01T11:00:00Z', '2013-01-01T11:00:00Z', '2013-01-01T11:00:00Z'

4.1.3 dplyr polars basics

Instead of piping from one function to the next, we use method chaining with polars. Wrapping the whole chain in a parenthesis allows us to insert vertical space to improve code readability. You can also put a \ at the end of each line.

(
    flights
    .filter(pl.col('dest') == "IAH")
    .group_by('year', 'month', 'day')
    .agg(pl.mean('arr_delay'))
)

4.2.1 filter()

Again, the same name! filter()ing with polars is more verbose than with dplyr, but that’s what happens without non-standard evaluation.

flights.filter(
    pl.col("dep_delay") > 120
)
shape: (9_723, 19)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
i64 i64 i64 f64 i64 f64 f64 i64 f64 str i64 str str str f64 i64 i64 i64 str
2013 1 1 848.0 1835 853.0 1001.0 1950 851.0 "MQ" 3944 "N942MQ" "JFK" "BWI" 41.0 184 18 35 "2013-01-01T23:…
2013 1 1 957.0 733 144.0 1056.0 853 123.0 "UA" 856 "N534UA" "EWR" "BOS" 37.0 200 7 33 "2013-01-01T12:…
2013 1 1 1114.0 900 134.0 1447.0 1222 145.0 "UA" 1086 "N76502" "LGA" "IAH" 248.0 1416 9 0 "2013-01-01T14:…
2013 1 1 1540.0 1338 122.0 2020.0 1825 115.0 "B6" 705 "N570JB" "JFK" "SJU" 193.0 1598 13 38 "2013-01-01T18:…
2013 1 1 1815.0 1325 290.0 2120.0 1542 338.0 "EV" 4417 "N17185" "EWR" "OMA" 213.0 1134 13 25 "2013-01-01T18:…
2013 9 30 1823.0 1545 158.0 1934.0 1733 121.0 "9E" 3459 "N916XJ" "JFK" "BNA" 95.0 765 15 45 "2013-09-30T19:…
2013 9 30 1951.0 1649 182.0 2157.0 1903 174.0 "EV" 4294 "N13988" "EWR" "SAV" 95.0 708 16 49 "2013-09-30T20:…
2013 9 30 2053.0 1815 158.0 2310.0 2054 136.0 "EV" 5292 "N600QX" "EWR" "ATL" 91.0 746 18 15 "2013-09-30T22:…
2013 9 30 2159.0 1845 194.0 2344.0 2030 194.0 "9E" 3320 "N906XJ" "JFK" "BUF" 50.0 301 18 45 "2013-09-30T22:…
2013 9 30 2235.0 2001 154.0 59.0 2249 130.0 "B6" 1083 "N804JB" "JFK" "MCO" 123.0 944 20 1 "2013-10-01T00:…
flights.filter(
    (pl.col("month") == 1) & (pl.col("day") == 1)
)
shape: (842, 19)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
i64 i64 i64 f64 i64 f64 f64 i64 f64 str i64 str str str f64 i64 i64 i64 str
2013 1 1 517.0 515 2.0 830.0 819 11.0 "UA" 1545 "N14228" "EWR" "IAH" 227.0 1400 5 15 "2013-01-01T10:…
2013 1 1 533.0 529 4.0 850.0 830 20.0 "UA" 1714 "N24211" "LGA" "IAH" 227.0 1416 5 29 "2013-01-01T10:…
2013 1 1 542.0 540 2.0 923.0 850 33.0 "AA" 1141 "N619AA" "JFK" "MIA" 160.0 1089 5 40 "2013-01-01T10:…
2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 "B6" 725 "N804JB" "JFK" "BQN" 183.0 1576 5 45 "2013-01-01T10:…
2013 1 1 554.0 600 -6.0 812.0 837 -25.0 "DL" 461 "N668DN" "LGA" "ATL" 116.0 762 6 0 "2013-01-01T11:…
2013 1 1 2356.0 2359 -3.0 425.0 437 -12.0 "B6" 727 "N588JB" "JFK" "BQN" 186.0 1576 23 59 "2013-01-02T04:…
2013 1 1 null 1630 null null 1815 null "EV" 4308 "N18120" "EWR" "RDU" null 416 16 30 "2013-01-01T21:…
2013 1 1 null 1935 null null 2240 null "AA" 791 "N3EHAA" "LGA" "DFW" null 1389 19 35 "2013-01-02T00:…
2013 1 1 null 1500 null null 1825 null "AA" 1925 "N3EVAA" "LGA" "MIA" null 1096 15 0 "2013-01-01T20:…
2013 1 1 null 600 null null 901 null "B6" 125 "N618JB" "JFK" "FLL" null 1069 6 0 "2013-01-01T11:…
flights.filter(
    (pl.col("month") == 1) | (pl.col("month") == 2)
)
shape: (51_955, 19)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
i64 i64 i64 f64 i64 f64 f64 i64 f64 str i64 str str str f64 i64 i64 i64 str
2013 1 1 517.0 515 2.0 830.0 819 11.0 "UA" 1545 "N14228" "EWR" "IAH" 227.0 1400 5 15 "2013-01-01T10:…
2013 1 1 533.0 529 4.0 850.0 830 20.0 "UA" 1714 "N24211" "LGA" "IAH" 227.0 1416 5 29 "2013-01-01T10:…
2013 1 1 542.0 540 2.0 923.0 850 33.0 "AA" 1141 "N619AA" "JFK" "MIA" 160.0 1089 5 40 "2013-01-01T10:…
2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 "B6" 725 "N804JB" "JFK" "BQN" 183.0 1576 5 45 "2013-01-01T10:…
2013 1 1 554.0 600 -6.0 812.0 837 -25.0 "DL" 461 "N668DN" "LGA" "ATL" 116.0 762 6 0 "2013-01-01T11:…
2013 2 28 null 850 null null 1035 null "MQ" 4558 "N737MQ" "LGA" "CLE" null 419 8 50 "2013-02-28T13:…
2013 2 28 null 905 null null 1115 null "MQ" 4478 "N722MQ" "LGA" "DTW" null 502 9 5 "2013-02-28T14:…
2013 2 28 null 1115 null null 1310 null "MQ" 4485 "N725MQ" "LGA" "CMH" null 479 11 15 "2013-02-28T16:…
2013 2 28 null 830 null null 1205 null "UA" 1480 null "EWR" "SFO" null 2565 8 30 "2013-02-28T13:…
2013 2 28 null 840 null null 1147 null "UA" 443 null "JFK" "LAX" null 2475 8 40 "2013-02-28T13:…
flights.filter(
    pl.col("month").is_in([1, 2])
)
shape: (51_955, 19)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
i64 i64 i64 f64 i64 f64 f64 i64 f64 str i64 str str str f64 i64 i64 i64 str
2013 1 1 517.0 515 2.0 830.0 819 11.0 "UA" 1545 "N14228" "EWR" "IAH" 227.0 1400 5 15 "2013-01-01T10:…
2013 1 1 533.0 529 4.0 850.0 830 20.0 "UA" 1714 "N24211" "LGA" "IAH" 227.0 1416 5 29 "2013-01-01T10:…
2013 1 1 542.0 540 2.0 923.0 850 33.0 "AA" 1141 "N619AA" "JFK" "MIA" 160.0 1089 5 40 "2013-01-01T10:…
2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 "B6" 725 "N804JB" "JFK" "BQN" 183.0 1576 5 45 "2013-01-01T10:…
2013 1 1 554.0 600 -6.0 812.0 837 -25.0 "DL" 461 "N668DN" "LGA" "ATL" 116.0 762 6 0 "2013-01-01T11:…
2013 2 28 null 850 null null 1035 null "MQ" 4558 "N737MQ" "LGA" "CLE" null 419 8 50 "2013-02-28T13:…
2013 2 28 null 905 null null 1115 null "MQ" 4478 "N722MQ" "LGA" "DTW" null 502 9 5 "2013-02-28T14:…
2013 2 28 null 1115 null null 1310 null "MQ" 4485 "N725MQ" "LGA" "CMH" null 479 11 15 "2013-02-28T16:…
2013 2 28 null 830 null null 1205 null "UA" 1480 null "EWR" "SFO" null 2565 8 30 "2013-02-28T13:…
2013 2 28 null 840 null null 1147 null "UA" 443 null "JFK" "LAX" null 2475 8 40 "2013-02-28T13:…

4.2.3 arrange() sort()

Sometimes you don’t need a polars expression, you can just give the column names as strings.

flights.sort("year", "month", "day", "dep_time")
shape: (336_776, 19)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
i64 i64 i64 f64 i64 f64 f64 i64 f64 str i64 str str str f64 i64 i64 i64 str
2013 1 1 null 1630 null null 1815 null "EV" 4308 "N18120" "EWR" "RDU" null 416 16 30 "2013-01-01T21:…
2013 1 1 null 1935 null null 2240 null "AA" 791 "N3EHAA" "LGA" "DFW" null 1389 19 35 "2013-01-02T00:…
2013 1 1 null 1500 null null 1825 null "AA" 1925 "N3EVAA" "LGA" "MIA" null 1096 15 0 "2013-01-01T20:…
2013 1 1 null 600 null null 901 null "B6" 125 "N618JB" "JFK" "FLL" null 1069 6 0 "2013-01-01T11:…
2013 1 1 517.0 515 2.0 830.0 819 11.0 "UA" 1545 "N14228" "EWR" "IAH" 227.0 1400 5 15 "2013-01-01T10:…
2013 12 31 2321.0 2250 31.0 46.0 8 38.0 "B6" 2002 "N179JB" "JFK" "BUF" 66.0 301 22 50 "2014-01-01T03:…
2013 12 31 2328.0 2330 -2.0 412.0 409 3.0 "B6" 1389 "N651JB" "EWR" "SJU" 198.0 1608 23 30 "2014-01-01T04:…
2013 12 31 2332.0 2245 47.0 58.0 3 55.0 "B6" 486 "N334JB" "JFK" "ROC" 60.0 264 22 45 "2014-01-01T03:…
2013 12 31 2355.0 2359 -4.0 430.0 440 -10.0 "B6" 1503 "N509JB" "JFK" "SJU" 195.0 1598 23 59 "2014-01-01T04:…
2013 12 31 2356.0 2359 -3.0 436.0 445 -9.0 "B6" 745 "N665JB" "JFK" "PSE" 200.0 1617 23 59 "2014-01-01T04:…
flights.sort("dep_delay", descending=True)
shape: (336_776, 19)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
i64 i64 i64 f64 i64 f64 f64 i64 f64 str i64 str str str f64 i64 i64 i64 str
2013 1 9 641.0 900 1301.0 1242.0 1530 1272.0 "HA" 51 "N384HA" "JFK" "HNL" 640.0 4983 9 0 "2013-01-09T14:…
2013 6 15 1432.0 1935 1137.0 1607.0 2120 1127.0 "MQ" 3535 "N504MQ" "JFK" "CMH" 74.0 483 19 35 "2013-06-15T23:…
2013 1 10 1121.0 1635 1126.0 1239.0 1810 1109.0 "MQ" 3695 "N517MQ" "EWR" "ORD" 111.0 719 16 35 "2013-01-10T21:…
2013 9 20 1139.0 1845 1014.0 1457.0 2210 1007.0 "AA" 177 "N338AA" "JFK" "SFO" 354.0 2586 18 45 "2013-09-20T22:…
2013 7 22 845.0 1600 1005.0 1044.0 1815 989.0 "MQ" 3075 "N665MQ" "JFK" "CVG" 96.0 589 16 0 "2013-07-22T20:…
2013 1 2 null 1540 null null 1747 null "EV" 4352 "N10575" "EWR" "CVG" null 569 15 40 "2013-01-02T20:…
2013 1 1 null 600 null null 901 null "B6" 125 "N618JB" "JFK" "FLL" null 1069 6 0 "2013-01-01T11:…
2013 1 1 null 1500 null null 1825 null "AA" 1925 "N3EVAA" "LGA" "MIA" null 1096 15 0 "2013-01-01T20:…
2013 1 1 null 1935 null null 2240 null "AA" 791 "N3EHAA" "LGA" "DFW" null 1389 19 35 "2013-01-02T00:…
2013 1 1 null 1630 null null 1815 null "EV" 4308 "N18120" "EWR" "RDU" null 416 16 30 "2013-01-01T21:…

4.2.4 distinct() unique()

flights.unique()
shape: (336_776, 19)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
i64 i64 i64 f64 i64 f64 f64 i64 f64 str i64 str str str f64 i64 i64 i64 str
2013 1 1 557.0 600 -3.0 838.0 846 -8.0 "B6" 79 "N593JB" "JFK" "MCO" 140.0 944 6 0 "2013-01-01T11:…
2013 1 1 558.0 600 -2.0 924.0 917 7.0 "UA" 194 "N29129" "JFK" "LAX" 345.0 2475 6 0 "2013-01-01T11:…
2013 1 1 607.0 607 0.0 858.0 915 -17.0 "UA" 1077 "N53442" "EWR" "MIA" 157.0 1085 6 7 "2013-01-01T11:…
2013 1 1 615.0 615 0.0 833.0 842 -9.0 "DL" 575 "N326NB" "EWR" "ATL" 120.0 746 6 15 "2013-01-01T11:…
2013 1 1 628.0 630 -2.0 1137.0 1140 -3.0 "AA" 413 "N3BAAA" "JFK" "SJU" 192.0 1598 6 30 "2013-01-01T11:…
2013 9 30 2104.0 2019 45.0 2304.0 2240 24.0 "EV" 4333 "N11109" "EWR" "TUL" 154.0 1215 20 19 "2013-10-01T00:…
2013 9 30 2104.0 2110 -6.0 2323.0 2341 -18.0 "EV" 5811 "N17560" "EWR" "JAX" 112.0 820 21 10 "2013-10-01T01:…
2013 9 30 2207.0 2140 27.0 2257.0 2250 7.0 "MQ" 3660 "N532MQ" "LGA" "BNA" 97.0 764 21 40 "2013-10-01T01:…
2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 "B6" 2002 "N281JB" "JFK" "BUF" 52.0 301 22 50 "2013-10-01T02:…
2013 9 30 null 840 null null 1020 null "MQ" 3531 "N839MQ" "LGA" "RDU" null 431 8 40 "2013-09-30T12:…

The default for unique() with target columns is equivalent to distinct(.keep_all = TRUE).

flights.unique(["origin", "dest"])
shape: (224, 19)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
i64 i64 i64 f64 i64 f64 f64 i64 f64 str i64 str str str f64 i64 i64 i64 str
2013 1 1 554.0 600 -6.0 812.0 837 -25.0 "DL" 461 "N668DN" "LGA" "ATL" 116.0 762 6 0 "2013-01-01T11:…
2013 1 1 558.0 600 -2.0 853.0 856 -3.0 "B6" 71 "N657JB" "JFK" "TPA" 158.0 1005 6 0 "2013-01-01T11:…
2013 1 1 558.0 600 -2.0 923.0 937 -14.0 "UA" 1124 "N53441" "EWR" "SFO" 361.0 2565 6 0 "2013-01-01T11:…
2013 1 1 559.0 600 -1.0 941.0 910 31.0 "AA" 707 "N3DUAA" "LGA" "DFW" 257.0 1389 6 0 "2013-01-01T11:…
2013 1 1 559.0 559 0.0 702.0 706 -4.0 "B6" 1806 "N708JB" "JFK" "BOS" 44.0 187 5 59 "2013-01-01T10:…
2013 1 5 1620.0 1625 -5.0 1749.0 1819 -30.0 "EV" 5085 "N611QX" "LGA" "GSO" 71.0 461 16 25 "2013-01-05T21:…
2013 10 1 916.0 925 -9.0 1155.0 1207 -12.0 "9E" 4065 "N8903A" "LGA" "SDF" 95.0 659 9 25 "2013-10-01T13:…
2013 10 1 955.0 1000 -5.0 1156.0 1234 -38.0 "9E" 3574 "N913XJ" "LGA" "IND" 93.0 660 10 0 "2013-10-01T14:…
2013 10 3 1414.0 1350 24.0 1514.0 1453 21.0 "B6" 1338 "N368JB" "JFK" "MVY" 36.0 173 13 50 "2013-10-03T17:…
2013 12 1 1241.0 1215 26.0 1431.0 1431 0.0 "EV" 3264 "N16976" "EWR" "SBN" 94.0 637 12 15 "2013-12-01T17:…

Note though that by default unique() will choose any of the duplicate rows whereas distinct() returns the first of the duplicates.

You can tell unique to choose the first, but according to the documentation it’s less performant.

flights.unique(["origin", "dest"], keep="first")
shape: (224, 19)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
i64 i64 i64 f64 i64 f64 f64 i64 f64 str i64 str str str f64 i64 i64 i64 str
2013 1 1 558.0 600 -2.0 924.0 917 7.0 "UA" 194 "N29129" "JFK" "LAX" 345.0 2475 6 0 "2013-01-01T11:…
2013 1 1 602.0 610 -8.0 812.0 820 -8.0 "DL" 1919 "N971DL" "LGA" "MSP" 170.0 1020 6 10 "2013-01-01T11:…
2013 1 1 606.0 610 -4.0 858.0 910 -12.0 "AA" 1895 "N633AA" "EWR" "MIA" 152.0 1085 6 10 "2013-01-01T11:…
2013 1 1 629.0 630 -1.0 824.0 833 -9.0 "US" 1019 "N426US" "EWR" "CLT" 91.0 529 6 30 "2013-01-01T11:…
2013 1 1 652.0 655 -3.0 932.0 921 11.0 "B6" 117 "N178JB" "JFK" "MSY" 191.0 1182 6 55 "2013-01-01T11:…
2013 10 1 1543.0 1545 -2.0 1702.0 1721 -19.0 "EV" 5293 "N398CA" "LGA" "ORF" 46.0 296 15 45 "2013-10-01T19:…
2013 10 1 1658.0 1700 -2.0 2002.0 1955 7.0 "AA" 211 "N3FVAA" "JFK" "IAH" 214.0 1417 17 0 "2013-10-01T21:…
2013 10 3 1414.0 1350 24.0 1514.0 1453 21.0 "B6" 1338 "N368JB" "JFK" "MVY" 36.0 173 13 50 "2013-10-03T17:…
2013 12 19 1056.0 1103 -7.0 1210.0 1219 -9.0 "EV" 5252 "N848AS" "LGA" "MHT" 39.0 195 11 3 "2013-12-19T16:…
2013 7 5 1355.0 1400 -5.0 1550.0 1620 -30.0 "EV" 4137 "N11164" "EWR" "TVC" 93.0 644 14 0 "2013-07-05T18:…

To replicate the default distinct() behavior, you select() those columns first.

flights.select("origin", "dest").unique()
shape: (224, 2)
origin dest
str str
"LGA" "ORD"
"EWR" "PBI"
"LGA" "MSP"
"JFK" "RSW"
"JFK" "FLL"
"LGA" "HOU"
"LGA" "MSN"
"JFK" "MCI"
"LGA" "SBN"
"LGA" "CHO"

As far as I could tell, there isn’t a convenient count() equivalent in polars. It’s not too bad to replicate though.

(
    flights
    .group_by("origin", "dest")
    .count()
    .sort('count', descending=True)
)
shape: (224, 3)
origin dest count
str str u32
"JFK" "LAX" 11262
"LGA" "ATL" 10263
"LGA" "ORD" 8857
"JFK" "SFO" 8204
"LGA" "CLT" 6168
"JFK" "STL" 1
"JFK" "BHM" 1
"JFK" "MEM" 1
"LGA" "LEX" 1
"EWR" "LGA" 1

4.3.1 mutate() with_columns()

Again we have to be a bit more verbose in polars. At least we don’t need to repeat the DataFrame name over and over!

flights.with_columns(
    gain=pl.col("dep_delay") - pl.col("arr_delay"),
    speed=pl.col("distance") / pl.col("air_time") * 60
)
shape: (336_776, 21)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour gain speed
i64 i64 i64 f64 i64 f64 f64 i64 f64 str i64 str str str f64 i64 i64 i64 str f64 f64
2013 1 1 517.0 515 2.0 830.0 819 11.0 "UA" 1545 "N14228" "EWR" "IAH" 227.0 1400 5 15 "2013-01-01T10:… -9.0 370.044053
2013 1 1 533.0 529 4.0 850.0 830 20.0 "UA" 1714 "N24211" "LGA" "IAH" 227.0 1416 5 29 "2013-01-01T10:… -16.0 374.273128
2013 1 1 542.0 540 2.0 923.0 850 33.0 "AA" 1141 "N619AA" "JFK" "MIA" 160.0 1089 5 40 "2013-01-01T10:… -31.0 408.375
2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 "B6" 725 "N804JB" "JFK" "BQN" 183.0 1576 5 45 "2013-01-01T10:… 17.0 516.721311
2013 1 1 554.0 600 -6.0 812.0 837 -25.0 "DL" 461 "N668DN" "LGA" "ATL" 116.0 762 6 0 "2013-01-01T11:… 19.0 394.137931
2013 9 30 null 1455 null null 1634 null "9E" 3393 null "JFK" "DCA" null 213 14 55 "2013-09-30T18:… null null
2013 9 30 null 2200 null null 2312 null "9E" 3525 null "LGA" "SYR" null 198 22 0 "2013-10-01T02:… null null
2013 9 30 null 1210 null null 1330 null "MQ" 3461 "N535MQ" "LGA" "BNA" null 764 12 10 "2013-09-30T16:… null null
2013 9 30 null 1159 null null 1344 null "MQ" 3572 "N511MQ" "LGA" "CLE" null 419 11 59 "2013-09-30T15:… null null
2013 9 30 null 840 null null 1020 null "MQ" 3531 "N839MQ" "LGA" "RDU" null 431 8 40 "2013-09-30T12:… null null

From what I can tell, there isn’t any .before or .after equivalents in polars. We can approximate the behavior using .select(), which behaves similar to transmute() in dplyr. We also have to switch to using the .alias() method for naming our new columns so that we can tack on all the original columns to the end.

flights.select(
    (pl.col("dep_delay") - pl.col("arr_delay")).alias("gain"),
    (pl.col("distance") / pl.col("air_time") * 60).alias("speed"),
    pl.col("*")
)
shape: (336_776, 21)
gain speed year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
f64 f64 i64 i64 i64 f64 i64 f64 f64 i64 f64 str i64 str str str f64 i64 i64 i64 str
-9.0 370.044053 2013 1 1 517.0 515 2.0 830.0 819 11.0 "UA" 1545 "N14228" "EWR" "IAH" 227.0 1400 5 15 "2013-01-01T10:…
-16.0 374.273128 2013 1 1 533.0 529 4.0 850.0 830 20.0 "UA" 1714 "N24211" "LGA" "IAH" 227.0 1416 5 29 "2013-01-01T10:…
-31.0 408.375 2013 1 1 542.0 540 2.0 923.0 850 33.0 "AA" 1141 "N619AA" "JFK" "MIA" 160.0 1089 5 40 "2013-01-01T10:…
17.0 516.721311 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 "B6" 725 "N804JB" "JFK" "BQN" 183.0 1576 5 45 "2013-01-01T10:…
19.0 394.137931 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 "DL" 461 "N668DN" "LGA" "ATL" 116.0 762 6 0 "2013-01-01T11:…
null null 2013 9 30 null 1455 null null 1634 null "9E" 3393 null "JFK" "DCA" null 213 14 55 "2013-09-30T18:…
null null 2013 9 30 null 2200 null null 2312 null "9E" 3525 null "LGA" "SYR" null 198 22 0 "2013-10-01T02:…
null null 2013 9 30 null 1210 null null 1330 null "MQ" 3461 "N535MQ" "LGA" "BNA" null 764 12 10 "2013-09-30T16:…
null null 2013 9 30 null 1159 null null 1344 null "MQ" 3572 "N511MQ" "LGA" "CLE" null 419 11 59 "2013-09-30T15:…
null null 2013 9 30 null 840 null null 1020 null "MQ" 3531 "N839MQ" "LGA" "RDU" null 431 8 40 "2013-09-30T12:…

The above works well for putting new columns at the start or end of a DataFrame. Doesn’t seem like there’s an easy way to place them next to a desired column.

4.3.2 select()

Another shared name here, with similar functionality!

flights.select("year", "month", "day")
shape: (336_776, 3)
year month day
i64 i64 i64
2013 1 1
2013 1 1
2013 1 1
2013 1 1
2013 1 1
2013 9 30
2013 9 30
2013 9 30
2013 9 30
2013 9 30

I couldn’t find a simple way of selecting a range of columns based on their name. It’s split up into two parts here.

tcols = flights.columns[flights.find_idx_by_name("year"):flights.find_idx_by_name("day") + 1]
flights.select(tcols)
shape: (336_776, 3)
year month day
i64 i64 i64
2013 1 1
2013 1 1
2013 1 1
2013 1 1
2013 1 1
2013 9 30
2013 9 30
2013 9 30
2013 9 30
2013 9 30

To select everything but the range of of columns we invoke a selector.

flights.select(~cs.by_name(tcols))
shape: (336_776, 16)
dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
f64 i64 f64 f64 i64 f64 str i64 str str str f64 i64 i64 i64 str
517.0 515 2.0 830.0 819 11.0 "UA" 1545 "N14228" "EWR" "IAH" 227.0 1400 5 15 "2013-01-01T10:…
533.0 529 4.0 850.0 830 20.0 "UA" 1714 "N24211" "LGA" "IAH" 227.0 1416 5 29 "2013-01-01T10:…
542.0 540 2.0 923.0 850 33.0 "AA" 1141 "N619AA" "JFK" "MIA" 160.0 1089 5 40 "2013-01-01T10:…
544.0 545 -1.0 1004.0 1022 -18.0 "B6" 725 "N804JB" "JFK" "BQN" 183.0 1576 5 45 "2013-01-01T10:…
554.0 600 -6.0 812.0 837 -25.0 "DL" 461 "N668DN" "LGA" "ATL" 116.0 762 6 0 "2013-01-01T11:…
null 1455 null null 1634 null "9E" 3393 null "JFK" "DCA" null 213 14 55 "2013-09-30T18:…
null 2200 null null 2312 null "9E" 3525 null "LGA" "SYR" null 198 22 0 "2013-10-01T02:…
null 1210 null null 1330 null "MQ" 3461 "N535MQ" "LGA" "BNA" null 764 12 10 "2013-09-30T16:…
null 1159 null null 1344 null "MQ" 3572 "N511MQ" "LGA" "CLE" null 419 11 59 "2013-09-30T15:…
null 840 null null 1020 null "MQ" 3531 "N839MQ" "LGA" "RDU" null 431 8 40 "2013-09-30T12:…
flights.select(cs.string())  # same as using `where(is.character)`
shape: (336_776, 5)
carrier tailnum origin dest time_hour
str str str str str
"UA" "N14228" "EWR" "IAH" "2013-01-01T10:…
"UA" "N24211" "LGA" "IAH" "2013-01-01T10:…
"AA" "N619AA" "JFK" "MIA" "2013-01-01T10:…
"B6" "N804JB" "JFK" "BQN" "2013-01-01T10:…
"DL" "N668DN" "LGA" "ATL" "2013-01-01T11:…
"9E" null "JFK" "DCA" "2013-09-30T18:…
"9E" null "LGA" "SYR" "2013-10-01T02:…
"MQ" "N535MQ" "LGA" "BNA" "2013-09-30T16:…
"MQ" "N511MQ" "LGA" "CLE" "2013-09-30T15:…
"MQ" "N839MQ" "LGA" "RDU" "2013-09-30T12:…

Many of the selection helpers are the same in polars:

  • cs.starts_with(“abc”)
  • cs.ends_with(“xyz”)
  • cs.contains(“ijk”)

No equivalent to num_range() that I could find, however.

Renaming and selection at the same time is possible though.

flights.select(tail_num = "tailnum")
shape: (336_776, 1)
tail_num
str
"N14228"
"N24211"
"N619AA"
"N804JB"
"N668DN"
null
null
"N535MQ"
"N511MQ"
"N839MQ"

4.3.3 rename()

rename() in polars takes a dictionary.

flights.rename({"tailnum": "tail_num"})
shape: (336_776, 19)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tail_num origin dest air_time distance hour minute time_hour
i64 i64 i64 f64 i64 f64 f64 i64 f64 str i64 str str str f64 i64 i64 i64 str
2013 1 1 517.0 515 2.0 830.0 819 11.0 "UA" 1545 "N14228" "EWR" "IAH" 227.0 1400 5 15 "2013-01-01T10:…
2013 1 1 533.0 529 4.0 850.0 830 20.0 "UA" 1714 "N24211" "LGA" "IAH" 227.0 1416 5 29 "2013-01-01T10:…
2013 1 1 542.0 540 2.0 923.0 850 33.0 "AA" 1141 "N619AA" "JFK" "MIA" 160.0 1089 5 40 "2013-01-01T10:…
2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 "B6" 725 "N804JB" "JFK" "BQN" 183.0 1576 5 45 "2013-01-01T10:…
2013 1 1 554.0 600 -6.0 812.0 837 -25.0 "DL" 461 "N668DN" "LGA" "ATL" 116.0 762 6 0 "2013-01-01T11:…
2013 9 30 null 1455 null null 1634 null "9E" 3393 null "JFK" "DCA" null 213 14 55 "2013-09-30T18:…
2013 9 30 null 2200 null null 2312 null "9E" 3525 null "LGA" "SYR" null 198 22 0 "2013-10-01T02:…
2013 9 30 null 1210 null null 1330 null "MQ" 3461 "N535MQ" "LGA" "BNA" null 764 12 10 "2013-09-30T16:…
2013 9 30 null 1159 null null 1344 null "MQ" 3572 "N511MQ" "LGA" "CLE" null 419 11 59 "2013-09-30T15:…
2013 9 30 null 840 null null 1020 null "MQ" 3531 "N839MQ" "LGA" "RDU" null 431 8 40 "2013-09-30T12:…

4.3.4 relocate() ???

I can’t find a relocate() equivalent in polars. I think you’d have to get clever with .select() and pl.all() to make it happen.

4.5.1 group_by()

The workhorse.

flights.group_by("month")
<polars.dataframe.group_by.GroupBy at 0x105d2f790>

4.5.2 summarize() agg()

A thing to note about polars is most of the aggregating functions ignore missing data by default. Coming from R, you’d expect NA results unless you explicitly ignore NAs.

(
    flights
    .group_by("month")
    .agg(
        avg_delay = pl.mean("dep_delay")  # Ignores missing by default
    )
    .sort("month")  # To compare to book
)
shape: (12, 2)
month avg_delay
i64 f64
1 10.036665
2 10.816843
3 13.227076
4 13.938038
5 12.986859
8 12.61104
9 6.722476
10 6.243988
11 5.435362
12 16.576688
(
    flights
    .group_by("month")
    .agg(
        avg_delay = pl.mean("dep_delay"),
        n = pl.count()
    )
    .sort("month")  # To compare to book.
)
shape: (12, 3)
month avg_delay n
i64 f64 u32
1 10.036665 27004
2 10.816843 24951
3 13.227076 28834
4 13.938038 28330
5 12.986859 28796
8 12.61104 29327
9 6.722476 27574
10 6.243988 28889
11 5.435362 27268
12 16.576688 28135

4.5.3 The slice_ window functions

This section is the largest departure so far from dplyr. In polars we use window functions (like SQL) to perform the type of computations implemented via slice_.

(
    flights 
    .filter(
        pl.col("arr_delay") == pl.col("arr_delay").max().over("dest")
    )
    .select(
        pl.col("dest"), 
        pl.col("*").exclude("dest")  # Example of getting around a lack of `relocate()`.
    )
    .sort("dest")  # For book comparison
)
shape: (107, 19)
dest year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin air_time distance hour minute time_hour
str i64 i64 i64 f64 i64 f64 f64 i64 f64 str i64 str str f64 i64 i64 i64 str
"ABQ" 2013 7 22 2145.0 2007 98.0 132.0 2259 153.0 "B6" 1505 "N784JB" "JFK" 259.0 1826 20 7 "2013-07-23T00:…
"ACK" 2013 7 23 1139.0 800 219.0 1250.0 909 221.0 "B6" 1491 "N192JB" "JFK" 44.0 199 8 0 "2013-07-23T12:…
"ALB" 2013 1 25 123.0 2000 323.0 229.0 2101 328.0 "EV" 4309 "N13908" "EWR" 30.0 143 20 0 "2013-01-26T01:…
"ANC" 2013 8 17 1740.0 1625 75.0 2042.0 2003 39.0 "UA" 887 "N528UA" "EWR" 404.0 3370 16 25 "2013-08-17T20:…
"ATL" 2013 7 22 2257.0 759 898.0 121.0 1026 895.0 "DL" 2047 "N6716C" "LGA" 109.0 762 7 59 "2013-07-22T11:…
"TPA" 2013 4 10 1100.0 1900 960.0 1342.0 2211 931.0 "DL" 2391 "N959DL" "JFK" 139.0 1005 19 0 "2013-04-10T23:…
"TUL" 2013 1 20 2347.0 1936 251.0 230.0 2208 262.0 "EV" 4333 "N14177" "EWR" 208.0 1215 19 36 "2013-01-21T00:…
"TVC" 2013 8 10 1120.0 730 230.0 1323.0 943 220.0 "EV" 5144 "N718EV" "LGA" 96.0 655 7 30 "2013-08-10T11:…
"TYS" 2013 4 12 51.0 2000 291.0 250.0 2209 281.0 "EV" 3826 "N15986" "EWR" 103.0 631 20 0 "2013-04-13T00:…
"XNA" 2013 3 8 1311.0 822 289.0 1615.0 1045 330.0 "EV" 4140 "N18556" "EWR" 168.0 1131 8 22 "2013-03-08T13:…

4.5.4 Grouping by multiple variables

daily = flights.group_by("year", "month", "day")
daily.count()
shape: (365, 4)
year month day count
i64 i64 i64 u32
2013 1 7 933
2013 1 14 928
2013 1 27 823
2013 10 10 994
2013 10 17 995
2013 9 17 961
2013 9 20 994
2013 9 23 993
2013 9 26 996
2013 9 29 914

4.5.5 Ungrouping

No takebacks on grouping in polars. However, groupings do not persist after aggregation.

type(daily)
polars.dataframe.group_by.GroupBy
type(daily.count())
polars.dataframe.frame.DataFrame

4.5.6 .by

Nothing to do here.

4.6 Case study: aggregates and sample size

For fun, let’s replicate this code as well.

from teqniqly.lahman_datasets import LahmanDatasets

ld = LahmanDatasets()
ld.load()

batting_loc = next(filter(lambda x: x.endswith("Batting"), ld.dataframe_names))
batters = (
    pl.from_pandas(ld[batting_loc])
    .group_by("playerID")
    .agg(
        performance = pl.col("H").sum() / pl.col("AB").sum(),
        n = pl.col("AB").sum()
    )
)

batters
shape: (20_469, 3)
playerID performance n
str f64 i64
"martijd01" 0.162162 37
"peterja01" 0.230807 2019
"clintlo01" 0.247097 2153
"smithca02" NaN 0
"dumovni01" 0.241379 29
"wardja01" 0.163265 49
"clairda01" 0.142857 7
"ogdenja01" 0.2 170
"bilkost01" 0.248562 1738
"odayha01" 0.189718 817

Plotting isn’t really the focus of this project, but let’s try replicating this simple chart. This also allows us to try out the .pipe() method of a polars DataFrame.

import altair as alt
alt.data_transformers.enable("vegafusion")

chart = (
    batters
    .filter(pl.col("n") > 100)
    .pipe(alt.Chart)
    .encode(x="n", y="performance")
    .mark_point(color="black", opacity=1 / 10)
    .properties(width='container')
)

chart + chart.transform_loess('n', 'performance').mark_line()