import polars as pl
import polars.selectors as cs
from nycflights13 import flights
10) # Make it easier to compare results to book.
pl.Config.set_tbl_rows(
= pl.from_pandas(flights) # Convert to polars dataframe. flights
Introduction
I recently started to become very interested in the polars
python package for data analysis. I appreciated that there seemed to be a philosophical viewpoint about how to interact with data first, then tooling built to enable that viewpoint. It reminded me of the thoughtfulness put into the tidyverse in R. I’m very attached to the tidy approach to data and have found it annoying when needing to use clunkier libraries when working in python. With Polars, there seems to be a true python analogue to the tidyverse (while staying pythonic).
To really compare the two frameworks, I’ve decided to translate relevant parts of Hadley’s Wickham’s excellent R for Data Science (2nd edition) to polars
.
Translated libraries
It should be mentioned that there exist python libraries that attempt to directly replicate the tidyverse experience. The ones I know about are:
tidypolars - Most relevant to this project, replicates
dplyr
functions with apolars
backend.suiba - Another dplyr replication, not
polars
backed, but offers SQL translation of verbs.rpolars - Going the other direction, the
polars
framework in R.
For me, I prefer a complete context switch when moving between R and python. I admire the goals of the above packages, but there always seems to be something a little different that ends up confuses me.
Format
These posts will focus on code translations from R for Data Science with limited commentary. I would suggest reading Hadley’s book for context and finding the corresponding section here to see how to perform the equivalent operations with polars.
Translation for Chapter 4
I start with Chapter 4 as that’s when the book really gets into data manipulation. The visualization parts would also be interesting to translate someday, but I’m not sure which visualization library in python I like all that much.
4.1.1 Prerequisites
The nycflights13
dataset is made available as a python package from Michael Chow.
Note also that polars
likes us to import helpers for column selection separately.
For future generations, I’m using the following version of polars
.
pl.__version__
'0.19.3'
4.1.2 nycflights13
Printing a DataFrame shows the head and tail rows.
flights
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | f64 | i64 | f64 | f64 | i64 | f64 | str | i64 | str | str | str | f64 | i64 | i64 | i64 | str |
2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | "UA" | 1545 | "N14228" | "EWR" | "IAH" | 227.0 | 1400 | 5 | 15 | "2013-01-01T10:… |
2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | "UA" | 1714 | "N24211" | "LGA" | "IAH" | 227.0 | 1416 | 5 | 29 | "2013-01-01T10:… |
2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | "AA" | 1141 | "N619AA" | "JFK" | "MIA" | 160.0 | 1089 | 5 | 40 | "2013-01-01T10:… |
2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | "B6" | 725 | "N804JB" | "JFK" | "BQN" | 183.0 | 1576 | 5 | 45 | "2013-01-01T10:… |
2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | "DL" | 461 | "N668DN" | "LGA" | "ATL" | 116.0 | 762 | 6 | 0 | "2013-01-01T11:… |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2013 | 9 | 30 | null | 1455 | null | null | 1634 | null | "9E" | 3393 | null | "JFK" | "DCA" | null | 213 | 14 | 55 | "2013-09-30T18:… |
2013 | 9 | 30 | null | 2200 | null | null | 2312 | null | "9E" | 3525 | null | "LGA" | "SYR" | null | 198 | 22 | 0 | "2013-10-01T02:… |
2013 | 9 | 30 | null | 1210 | null | null | 1330 | null | "MQ" | 3461 | "N535MQ" | "LGA" | "BNA" | null | 764 | 12 | 10 | "2013-09-30T16:… |
2013 | 9 | 30 | null | 1159 | null | null | 1344 | null | "MQ" | 3572 | "N511MQ" | "LGA" | "CLE" | null | 419 | 11 | 59 | "2013-09-30T15:… |
2013 | 9 | 30 | null | 840 | null | null | 1020 | null | "MQ" | 3531 | "N839MQ" | "LGA" | "RDU" | null | 431 | 8 | 40 | "2013-09-30T12:… |
So great to see glimpse()
here too.
flights.glimpse()
Rows: 336776
Columns: 19
$ year <i64> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013
$ month <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ day <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ dep_time <f64> 517.0, 533.0, 542.0, 544.0, 554.0, 554.0, 555.0, 557.0, 557.0, 558.0
$ sched_dep_time <i64> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600
$ dep_delay <f64> 2.0, 4.0, 2.0, -1.0, -6.0, -4.0, -5.0, -3.0, -3.0, -2.0
$ arr_time <f64> 830.0, 850.0, 923.0, 1004.0, 812.0, 740.0, 913.0, 709.0, 838.0, 753.0
$ sched_arr_time <i64> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745
$ arr_delay <f64> 11.0, 20.0, 33.0, -18.0, -25.0, 12.0, 19.0, -14.0, -8.0, 8.0
$ carrier <str> 'UA', 'UA', 'AA', 'B6', 'DL', 'UA', 'B6', 'EV', 'B6', 'AA'
$ flight <i64> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301
$ tailnum <str> 'N14228', 'N24211', 'N619AA', 'N804JB', 'N668DN', 'N39463', 'N516JB', 'N829AS', 'N593JB', 'N3ALAA'
$ origin <str> 'EWR', 'LGA', 'JFK', 'JFK', 'LGA', 'EWR', 'EWR', 'LGA', 'JFK', 'LGA'
$ dest <str> 'IAH', 'IAH', 'MIA', 'BQN', 'ATL', 'ORD', 'FLL', 'IAD', 'MCO', 'ORD'
$ air_time <f64> 227.0, 227.0, 160.0, 183.0, 116.0, 150.0, 158.0, 53.0, 140.0, 138.0
$ distance <i64> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733
$ hour <i64> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6
$ minute <i64> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0
$ time_hour <str> '2013-01-01T10:00:00Z', '2013-01-01T10:00:00Z', '2013-01-01T10:00:00Z', '2013-01-01T10:00:00Z', '2013-01-01T11:00:00Z', '2013-01-01T10:00:00Z', '2013-01-01T11:00:00Z', '2013-01-01T11:00:00Z', '2013-01-01T11:00:00Z', '2013-01-01T11:00:00Z'
4.1.3 dplyr polars basics
Instead of piping from one function to the next, we use method chaining with polars
. Wrapping the whole chain in a parenthesis allows us to insert vertical space to improve code readability. You can also put a \
at the end of each line.
(
flightsfilter(pl.col('dest') == "IAH")
.'year', 'month', 'day')
.group_by('arr_delay'))
.agg(pl.mean( )
4.2.1 filter()
Again, the same name! filter()
ing with polars is more verbose than with dplyr, but that’s what happens without non-standard evaluation.
filter(
flights."dep_delay") > 120
pl.col( )
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | f64 | i64 | f64 | f64 | i64 | f64 | str | i64 | str | str | str | f64 | i64 | i64 | i64 | str |
2013 | 1 | 1 | 848.0 | 1835 | 853.0 | 1001.0 | 1950 | 851.0 | "MQ" | 3944 | "N942MQ" | "JFK" | "BWI" | 41.0 | 184 | 18 | 35 | "2013-01-01T23:… |
2013 | 1 | 1 | 957.0 | 733 | 144.0 | 1056.0 | 853 | 123.0 | "UA" | 856 | "N534UA" | "EWR" | "BOS" | 37.0 | 200 | 7 | 33 | "2013-01-01T12:… |
2013 | 1 | 1 | 1114.0 | 900 | 134.0 | 1447.0 | 1222 | 145.0 | "UA" | 1086 | "N76502" | "LGA" | "IAH" | 248.0 | 1416 | 9 | 0 | "2013-01-01T14:… |
2013 | 1 | 1 | 1540.0 | 1338 | 122.0 | 2020.0 | 1825 | 115.0 | "B6" | 705 | "N570JB" | "JFK" | "SJU" | 193.0 | 1598 | 13 | 38 | "2013-01-01T18:… |
2013 | 1 | 1 | 1815.0 | 1325 | 290.0 | 2120.0 | 1542 | 338.0 | "EV" | 4417 | "N17185" | "EWR" | "OMA" | 213.0 | 1134 | 13 | 25 | "2013-01-01T18:… |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2013 | 9 | 30 | 1823.0 | 1545 | 158.0 | 1934.0 | 1733 | 121.0 | "9E" | 3459 | "N916XJ" | "JFK" | "BNA" | 95.0 | 765 | 15 | 45 | "2013-09-30T19:… |
2013 | 9 | 30 | 1951.0 | 1649 | 182.0 | 2157.0 | 1903 | 174.0 | "EV" | 4294 | "N13988" | "EWR" | "SAV" | 95.0 | 708 | 16 | 49 | "2013-09-30T20:… |
2013 | 9 | 30 | 2053.0 | 1815 | 158.0 | 2310.0 | 2054 | 136.0 | "EV" | 5292 | "N600QX" | "EWR" | "ATL" | 91.0 | 746 | 18 | 15 | "2013-09-30T22:… |
2013 | 9 | 30 | 2159.0 | 1845 | 194.0 | 2344.0 | 2030 | 194.0 | "9E" | 3320 | "N906XJ" | "JFK" | "BUF" | 50.0 | 301 | 18 | 45 | "2013-09-30T22:… |
2013 | 9 | 30 | 2235.0 | 2001 | 154.0 | 59.0 | 2249 | 130.0 | "B6" | 1083 | "N804JB" | "JFK" | "MCO" | 123.0 | 944 | 20 | 1 | "2013-10-01T00:… |
filter(
flights."month") == 1) & (pl.col("day") == 1)
(pl.col( )
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | f64 | i64 | f64 | f64 | i64 | f64 | str | i64 | str | str | str | f64 | i64 | i64 | i64 | str |
2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | "UA" | 1545 | "N14228" | "EWR" | "IAH" | 227.0 | 1400 | 5 | 15 | "2013-01-01T10:… |
2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | "UA" | 1714 | "N24211" | "LGA" | "IAH" | 227.0 | 1416 | 5 | 29 | "2013-01-01T10:… |
2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | "AA" | 1141 | "N619AA" | "JFK" | "MIA" | 160.0 | 1089 | 5 | 40 | "2013-01-01T10:… |
2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | "B6" | 725 | "N804JB" | "JFK" | "BQN" | 183.0 | 1576 | 5 | 45 | "2013-01-01T10:… |
2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | "DL" | 461 | "N668DN" | "LGA" | "ATL" | 116.0 | 762 | 6 | 0 | "2013-01-01T11:… |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2013 | 1 | 1 | 2356.0 | 2359 | -3.0 | 425.0 | 437 | -12.0 | "B6" | 727 | "N588JB" | "JFK" | "BQN" | 186.0 | 1576 | 23 | 59 | "2013-01-02T04:… |
2013 | 1 | 1 | null | 1630 | null | null | 1815 | null | "EV" | 4308 | "N18120" | "EWR" | "RDU" | null | 416 | 16 | 30 | "2013-01-01T21:… |
2013 | 1 | 1 | null | 1935 | null | null | 2240 | null | "AA" | 791 | "N3EHAA" | "LGA" | "DFW" | null | 1389 | 19 | 35 | "2013-01-02T00:… |
2013 | 1 | 1 | null | 1500 | null | null | 1825 | null | "AA" | 1925 | "N3EVAA" | "LGA" | "MIA" | null | 1096 | 15 | 0 | "2013-01-01T20:… |
2013 | 1 | 1 | null | 600 | null | null | 901 | null | "B6" | 125 | "N618JB" | "JFK" | "FLL" | null | 1069 | 6 | 0 | "2013-01-01T11:… |
filter(
flights."month") == 1) | (pl.col("month") == 2)
(pl.col( )
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | f64 | i64 | f64 | f64 | i64 | f64 | str | i64 | str | str | str | f64 | i64 | i64 | i64 | str |
2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | "UA" | 1545 | "N14228" | "EWR" | "IAH" | 227.0 | 1400 | 5 | 15 | "2013-01-01T10:… |
2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | "UA" | 1714 | "N24211" | "LGA" | "IAH" | 227.0 | 1416 | 5 | 29 | "2013-01-01T10:… |
2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | "AA" | 1141 | "N619AA" | "JFK" | "MIA" | 160.0 | 1089 | 5 | 40 | "2013-01-01T10:… |
2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | "B6" | 725 | "N804JB" | "JFK" | "BQN" | 183.0 | 1576 | 5 | 45 | "2013-01-01T10:… |
2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | "DL" | 461 | "N668DN" | "LGA" | "ATL" | 116.0 | 762 | 6 | 0 | "2013-01-01T11:… |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2013 | 2 | 28 | null | 850 | null | null | 1035 | null | "MQ" | 4558 | "N737MQ" | "LGA" | "CLE" | null | 419 | 8 | 50 | "2013-02-28T13:… |
2013 | 2 | 28 | null | 905 | null | null | 1115 | null | "MQ" | 4478 | "N722MQ" | "LGA" | "DTW" | null | 502 | 9 | 5 | "2013-02-28T14:… |
2013 | 2 | 28 | null | 1115 | null | null | 1310 | null | "MQ" | 4485 | "N725MQ" | "LGA" | "CMH" | null | 479 | 11 | 15 | "2013-02-28T16:… |
2013 | 2 | 28 | null | 830 | null | null | 1205 | null | "UA" | 1480 | null | "EWR" | "SFO" | null | 2565 | 8 | 30 | "2013-02-28T13:… |
2013 | 2 | 28 | null | 840 | null | null | 1147 | null | "UA" | 443 | null | "JFK" | "LAX" | null | 2475 | 8 | 40 | "2013-02-28T13:… |
filter(
flights."month").is_in([1, 2])
pl.col( )
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | f64 | i64 | f64 | f64 | i64 | f64 | str | i64 | str | str | str | f64 | i64 | i64 | i64 | str |
2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | "UA" | 1545 | "N14228" | "EWR" | "IAH" | 227.0 | 1400 | 5 | 15 | "2013-01-01T10:… |
2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | "UA" | 1714 | "N24211" | "LGA" | "IAH" | 227.0 | 1416 | 5 | 29 | "2013-01-01T10:… |
2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | "AA" | 1141 | "N619AA" | "JFK" | "MIA" | 160.0 | 1089 | 5 | 40 | "2013-01-01T10:… |
2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | "B6" | 725 | "N804JB" | "JFK" | "BQN" | 183.0 | 1576 | 5 | 45 | "2013-01-01T10:… |
2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | "DL" | 461 | "N668DN" | "LGA" | "ATL" | 116.0 | 762 | 6 | 0 | "2013-01-01T11:… |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2013 | 2 | 28 | null | 850 | null | null | 1035 | null | "MQ" | 4558 | "N737MQ" | "LGA" | "CLE" | null | 419 | 8 | 50 | "2013-02-28T13:… |
2013 | 2 | 28 | null | 905 | null | null | 1115 | null | "MQ" | 4478 | "N722MQ" | "LGA" | "DTW" | null | 502 | 9 | 5 | "2013-02-28T14:… |
2013 | 2 | 28 | null | 1115 | null | null | 1310 | null | "MQ" | 4485 | "N725MQ" | "LGA" | "CMH" | null | 479 | 11 | 15 | "2013-02-28T16:… |
2013 | 2 | 28 | null | 830 | null | null | 1205 | null | "UA" | 1480 | null | "EWR" | "SFO" | null | 2565 | 8 | 30 | "2013-02-28T13:… |
2013 | 2 | 28 | null | 840 | null | null | 1147 | null | "UA" | 443 | null | "JFK" | "LAX" | null | 2475 | 8 | 40 | "2013-02-28T13:… |
4.2.3 arrange()
sort()
arrange()
Sometimes you don’t need a polars
expression, you can just give the column names as strings.
"year", "month", "day", "dep_time") flights.sort(
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | f64 | i64 | f64 | f64 | i64 | f64 | str | i64 | str | str | str | f64 | i64 | i64 | i64 | str |
2013 | 1 | 1 | null | 1630 | null | null | 1815 | null | "EV" | 4308 | "N18120" | "EWR" | "RDU" | null | 416 | 16 | 30 | "2013-01-01T21:… |
2013 | 1 | 1 | null | 1935 | null | null | 2240 | null | "AA" | 791 | "N3EHAA" | "LGA" | "DFW" | null | 1389 | 19 | 35 | "2013-01-02T00:… |
2013 | 1 | 1 | null | 1500 | null | null | 1825 | null | "AA" | 1925 | "N3EVAA" | "LGA" | "MIA" | null | 1096 | 15 | 0 | "2013-01-01T20:… |
2013 | 1 | 1 | null | 600 | null | null | 901 | null | "B6" | 125 | "N618JB" | "JFK" | "FLL" | null | 1069 | 6 | 0 | "2013-01-01T11:… |
2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | "UA" | 1545 | "N14228" | "EWR" | "IAH" | 227.0 | 1400 | 5 | 15 | "2013-01-01T10:… |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2013 | 12 | 31 | 2321.0 | 2250 | 31.0 | 46.0 | 8 | 38.0 | "B6" | 2002 | "N179JB" | "JFK" | "BUF" | 66.0 | 301 | 22 | 50 | "2014-01-01T03:… |
2013 | 12 | 31 | 2328.0 | 2330 | -2.0 | 412.0 | 409 | 3.0 | "B6" | 1389 | "N651JB" | "EWR" | "SJU" | 198.0 | 1608 | 23 | 30 | "2014-01-01T04:… |
2013 | 12 | 31 | 2332.0 | 2245 | 47.0 | 58.0 | 3 | 55.0 | "B6" | 486 | "N334JB" | "JFK" | "ROC" | 60.0 | 264 | 22 | 45 | "2014-01-01T03:… |
2013 | 12 | 31 | 2355.0 | 2359 | -4.0 | 430.0 | 440 | -10.0 | "B6" | 1503 | "N509JB" | "JFK" | "SJU" | 195.0 | 1598 | 23 | 59 | "2014-01-01T04:… |
2013 | 12 | 31 | 2356.0 | 2359 | -3.0 | 436.0 | 445 | -9.0 | "B6" | 745 | "N665JB" | "JFK" | "PSE" | 200.0 | 1617 | 23 | 59 | "2014-01-01T04:… |
"dep_delay", descending=True) flights.sort(
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | f64 | i64 | f64 | f64 | i64 | f64 | str | i64 | str | str | str | f64 | i64 | i64 | i64 | str |
2013 | 1 | 9 | 641.0 | 900 | 1301.0 | 1242.0 | 1530 | 1272.0 | "HA" | 51 | "N384HA" | "JFK" | "HNL" | 640.0 | 4983 | 9 | 0 | "2013-01-09T14:… |
2013 | 6 | 15 | 1432.0 | 1935 | 1137.0 | 1607.0 | 2120 | 1127.0 | "MQ" | 3535 | "N504MQ" | "JFK" | "CMH" | 74.0 | 483 | 19 | 35 | "2013-06-15T23:… |
2013 | 1 | 10 | 1121.0 | 1635 | 1126.0 | 1239.0 | 1810 | 1109.0 | "MQ" | 3695 | "N517MQ" | "EWR" | "ORD" | 111.0 | 719 | 16 | 35 | "2013-01-10T21:… |
2013 | 9 | 20 | 1139.0 | 1845 | 1014.0 | 1457.0 | 2210 | 1007.0 | "AA" | 177 | "N338AA" | "JFK" | "SFO" | 354.0 | 2586 | 18 | 45 | "2013-09-20T22:… |
2013 | 7 | 22 | 845.0 | 1600 | 1005.0 | 1044.0 | 1815 | 989.0 | "MQ" | 3075 | "N665MQ" | "JFK" | "CVG" | 96.0 | 589 | 16 | 0 | "2013-07-22T20:… |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2013 | 1 | 2 | null | 1540 | null | null | 1747 | null | "EV" | 4352 | "N10575" | "EWR" | "CVG" | null | 569 | 15 | 40 | "2013-01-02T20:… |
2013 | 1 | 1 | null | 600 | null | null | 901 | null | "B6" | 125 | "N618JB" | "JFK" | "FLL" | null | 1069 | 6 | 0 | "2013-01-01T11:… |
2013 | 1 | 1 | null | 1500 | null | null | 1825 | null | "AA" | 1925 | "N3EVAA" | "LGA" | "MIA" | null | 1096 | 15 | 0 | "2013-01-01T20:… |
2013 | 1 | 1 | null | 1935 | null | null | 2240 | null | "AA" | 791 | "N3EHAA" | "LGA" | "DFW" | null | 1389 | 19 | 35 | "2013-01-02T00:… |
2013 | 1 | 1 | null | 1630 | null | null | 1815 | null | "EV" | 4308 | "N18120" | "EWR" | "RDU" | null | 416 | 16 | 30 | "2013-01-01T21:… |
4.2.4 distinct()
unique()
distinct()
flights.unique()
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | f64 | i64 | f64 | f64 | i64 | f64 | str | i64 | str | str | str | f64 | i64 | i64 | i64 | str |
2013 | 1 | 1 | 557.0 | 600 | -3.0 | 838.0 | 846 | -8.0 | "B6" | 79 | "N593JB" | "JFK" | "MCO" | 140.0 | 944 | 6 | 0 | "2013-01-01T11:… |
2013 | 1 | 1 | 558.0 | 600 | -2.0 | 924.0 | 917 | 7.0 | "UA" | 194 | "N29129" | "JFK" | "LAX" | 345.0 | 2475 | 6 | 0 | "2013-01-01T11:… |
2013 | 1 | 1 | 607.0 | 607 | 0.0 | 858.0 | 915 | -17.0 | "UA" | 1077 | "N53442" | "EWR" | "MIA" | 157.0 | 1085 | 6 | 7 | "2013-01-01T11:… |
2013 | 1 | 1 | 615.0 | 615 | 0.0 | 833.0 | 842 | -9.0 | "DL" | 575 | "N326NB" | "EWR" | "ATL" | 120.0 | 746 | 6 | 15 | "2013-01-01T11:… |
2013 | 1 | 1 | 628.0 | 630 | -2.0 | 1137.0 | 1140 | -3.0 | "AA" | 413 | "N3BAAA" | "JFK" | "SJU" | 192.0 | 1598 | 6 | 30 | "2013-01-01T11:… |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2013 | 9 | 30 | 2104.0 | 2019 | 45.0 | 2304.0 | 2240 | 24.0 | "EV" | 4333 | "N11109" | "EWR" | "TUL" | 154.0 | 1215 | 20 | 19 | "2013-10-01T00:… |
2013 | 9 | 30 | 2104.0 | 2110 | -6.0 | 2323.0 | 2341 | -18.0 | "EV" | 5811 | "N17560" | "EWR" | "JAX" | 112.0 | 820 | 21 | 10 | "2013-10-01T01:… |
2013 | 9 | 30 | 2207.0 | 2140 | 27.0 | 2257.0 | 2250 | 7.0 | "MQ" | 3660 | "N532MQ" | "LGA" | "BNA" | 97.0 | 764 | 21 | 40 | "2013-10-01T01:… |
2013 | 9 | 30 | 2240.0 | 2250 | -10.0 | 2347.0 | 7 | -20.0 | "B6" | 2002 | "N281JB" | "JFK" | "BUF" | 52.0 | 301 | 22 | 50 | "2013-10-01T02:… |
2013 | 9 | 30 | null | 840 | null | null | 1020 | null | "MQ" | 3531 | "N839MQ" | "LGA" | "RDU" | null | 431 | 8 | 40 | "2013-09-30T12:… |
The default for unique()
with target columns is equivalent to distinct(.keep_all = TRUE)
.
"origin", "dest"]) flights.unique([
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | f64 | i64 | f64 | f64 | i64 | f64 | str | i64 | str | str | str | f64 | i64 | i64 | i64 | str |
2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | "DL" | 461 | "N668DN" | "LGA" | "ATL" | 116.0 | 762 | 6 | 0 | "2013-01-01T11:… |
2013 | 1 | 1 | 558.0 | 600 | -2.0 | 853.0 | 856 | -3.0 | "B6" | 71 | "N657JB" | "JFK" | "TPA" | 158.0 | 1005 | 6 | 0 | "2013-01-01T11:… |
2013 | 1 | 1 | 558.0 | 600 | -2.0 | 923.0 | 937 | -14.0 | "UA" | 1124 | "N53441" | "EWR" | "SFO" | 361.0 | 2565 | 6 | 0 | "2013-01-01T11:… |
2013 | 1 | 1 | 559.0 | 600 | -1.0 | 941.0 | 910 | 31.0 | "AA" | 707 | "N3DUAA" | "LGA" | "DFW" | 257.0 | 1389 | 6 | 0 | "2013-01-01T11:… |
2013 | 1 | 1 | 559.0 | 559 | 0.0 | 702.0 | 706 | -4.0 | "B6" | 1806 | "N708JB" | "JFK" | "BOS" | 44.0 | 187 | 5 | 59 | "2013-01-01T10:… |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2013 | 1 | 5 | 1620.0 | 1625 | -5.0 | 1749.0 | 1819 | -30.0 | "EV" | 5085 | "N611QX" | "LGA" | "GSO" | 71.0 | 461 | 16 | 25 | "2013-01-05T21:… |
2013 | 10 | 1 | 916.0 | 925 | -9.0 | 1155.0 | 1207 | -12.0 | "9E" | 4065 | "N8903A" | "LGA" | "SDF" | 95.0 | 659 | 9 | 25 | "2013-10-01T13:… |
2013 | 10 | 1 | 955.0 | 1000 | -5.0 | 1156.0 | 1234 | -38.0 | "9E" | 3574 | "N913XJ" | "LGA" | "IND" | 93.0 | 660 | 10 | 0 | "2013-10-01T14:… |
2013 | 10 | 3 | 1414.0 | 1350 | 24.0 | 1514.0 | 1453 | 21.0 | "B6" | 1338 | "N368JB" | "JFK" | "MVY" | 36.0 | 173 | 13 | 50 | "2013-10-03T17:… |
2013 | 12 | 1 | 1241.0 | 1215 | 26.0 | 1431.0 | 1431 | 0.0 | "EV" | 3264 | "N16976" | "EWR" | "SBN" | 94.0 | 637 | 12 | 15 | "2013-12-01T17:… |
Note though that by default unique()
will choose any of the duplicate rows whereas distinct()
returns the first of the duplicates.
You can tell unique to choose the first, but according to the documentation it’s less performant.
"origin", "dest"], keep="first") flights.unique([
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | f64 | i64 | f64 | f64 | i64 | f64 | str | i64 | str | str | str | f64 | i64 | i64 | i64 | str |
2013 | 1 | 1 | 558.0 | 600 | -2.0 | 924.0 | 917 | 7.0 | "UA" | 194 | "N29129" | "JFK" | "LAX" | 345.0 | 2475 | 6 | 0 | "2013-01-01T11:… |
2013 | 1 | 1 | 602.0 | 610 | -8.0 | 812.0 | 820 | -8.0 | "DL" | 1919 | "N971DL" | "LGA" | "MSP" | 170.0 | 1020 | 6 | 10 | "2013-01-01T11:… |
2013 | 1 | 1 | 606.0 | 610 | -4.0 | 858.0 | 910 | -12.0 | "AA" | 1895 | "N633AA" | "EWR" | "MIA" | 152.0 | 1085 | 6 | 10 | "2013-01-01T11:… |
2013 | 1 | 1 | 629.0 | 630 | -1.0 | 824.0 | 833 | -9.0 | "US" | 1019 | "N426US" | "EWR" | "CLT" | 91.0 | 529 | 6 | 30 | "2013-01-01T11:… |
2013 | 1 | 1 | 652.0 | 655 | -3.0 | 932.0 | 921 | 11.0 | "B6" | 117 | "N178JB" | "JFK" | "MSY" | 191.0 | 1182 | 6 | 55 | "2013-01-01T11:… |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2013 | 10 | 1 | 1543.0 | 1545 | -2.0 | 1702.0 | 1721 | -19.0 | "EV" | 5293 | "N398CA" | "LGA" | "ORF" | 46.0 | 296 | 15 | 45 | "2013-10-01T19:… |
2013 | 10 | 1 | 1658.0 | 1700 | -2.0 | 2002.0 | 1955 | 7.0 | "AA" | 211 | "N3FVAA" | "JFK" | "IAH" | 214.0 | 1417 | 17 | 0 | "2013-10-01T21:… |
2013 | 10 | 3 | 1414.0 | 1350 | 24.0 | 1514.0 | 1453 | 21.0 | "B6" | 1338 | "N368JB" | "JFK" | "MVY" | 36.0 | 173 | 13 | 50 | "2013-10-03T17:… |
2013 | 12 | 19 | 1056.0 | 1103 | -7.0 | 1210.0 | 1219 | -9.0 | "EV" | 5252 | "N848AS" | "LGA" | "MHT" | 39.0 | 195 | 11 | 3 | "2013-12-19T16:… |
2013 | 7 | 5 | 1355.0 | 1400 | -5.0 | 1550.0 | 1620 | -30.0 | "EV" | 4137 | "N11164" | "EWR" | "TVC" | 93.0 | 644 | 14 | 0 | "2013-07-05T18:… |
To replicate the default distinct()
behavior, you select()
those columns first.
"origin", "dest").unique() flights.select(
origin | dest |
---|---|
str | str |
"LGA" | "ORD" |
"EWR" | "PBI" |
"LGA" | "MSP" |
"JFK" | "RSW" |
"JFK" | "FLL" |
… | … |
"LGA" | "HOU" |
"LGA" | "MSN" |
"JFK" | "MCI" |
"LGA" | "SBN" |
"LGA" | "CHO" |
As far as I could tell, there isn’t a convenient count()
equivalent in polars
. It’s not too bad to replicate though.
(
flights"origin", "dest")
.group_by(
.count()'count', descending=True)
.sort( )
origin | dest | count |
---|---|---|
str | str | u32 |
"JFK" | "LAX" | 11262 |
"LGA" | "ATL" | 10263 |
"LGA" | "ORD" | 8857 |
"JFK" | "SFO" | 8204 |
"LGA" | "CLT" | 6168 |
… | … | … |
"JFK" | "STL" | 1 |
"JFK" | "BHM" | 1 |
"JFK" | "MEM" | 1 |
"LGA" | "LEX" | 1 |
"EWR" | "LGA" | 1 |
4.3.1 mutate()
with_columns()
mutate()
Again we have to be a bit more verbose in polars
. At least we don’t need to repeat the DataFrame name over and over!
flights.with_columns(=pl.col("dep_delay") - pl.col("arr_delay"),
gain=pl.col("distance") / pl.col("air_time") * 60
speed )
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | gain | speed |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | f64 | i64 | f64 | f64 | i64 | f64 | str | i64 | str | str | str | f64 | i64 | i64 | i64 | str | f64 | f64 |
2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | "UA" | 1545 | "N14228" | "EWR" | "IAH" | 227.0 | 1400 | 5 | 15 | "2013-01-01T10:… | -9.0 | 370.044053 |
2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | "UA" | 1714 | "N24211" | "LGA" | "IAH" | 227.0 | 1416 | 5 | 29 | "2013-01-01T10:… | -16.0 | 374.273128 |
2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | "AA" | 1141 | "N619AA" | "JFK" | "MIA" | 160.0 | 1089 | 5 | 40 | "2013-01-01T10:… | -31.0 | 408.375 |
2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | "B6" | 725 | "N804JB" | "JFK" | "BQN" | 183.0 | 1576 | 5 | 45 | "2013-01-01T10:… | 17.0 | 516.721311 |
2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | "DL" | 461 | "N668DN" | "LGA" | "ATL" | 116.0 | 762 | 6 | 0 | "2013-01-01T11:… | 19.0 | 394.137931 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2013 | 9 | 30 | null | 1455 | null | null | 1634 | null | "9E" | 3393 | null | "JFK" | "DCA" | null | 213 | 14 | 55 | "2013-09-30T18:… | null | null |
2013 | 9 | 30 | null | 2200 | null | null | 2312 | null | "9E" | 3525 | null | "LGA" | "SYR" | null | 198 | 22 | 0 | "2013-10-01T02:… | null | null |
2013 | 9 | 30 | null | 1210 | null | null | 1330 | null | "MQ" | 3461 | "N535MQ" | "LGA" | "BNA" | null | 764 | 12 | 10 | "2013-09-30T16:… | null | null |
2013 | 9 | 30 | null | 1159 | null | null | 1344 | null | "MQ" | 3572 | "N511MQ" | "LGA" | "CLE" | null | 419 | 11 | 59 | "2013-09-30T15:… | null | null |
2013 | 9 | 30 | null | 840 | null | null | 1020 | null | "MQ" | 3531 | "N839MQ" | "LGA" | "RDU" | null | 431 | 8 | 40 | "2013-09-30T12:… | null | null |
From what I can tell, there isn’t any .before
or .after
equivalents in polars
. We can approximate the behavior using .select()
, which behaves similar to transmute()
in dplyr
. We also have to switch to using the .alias()
method for naming our new columns so that we can tack on all the original columns to the end.
flights.select("dep_delay") - pl.col("arr_delay")).alias("gain"),
(pl.col("distance") / pl.col("air_time") * 60).alias("speed"),
(pl.col("*")
pl.col( )
gain | speed | year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
f64 | f64 | i64 | i64 | i64 | f64 | i64 | f64 | f64 | i64 | f64 | str | i64 | str | str | str | f64 | i64 | i64 | i64 | str |
-9.0 | 370.044053 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | "UA" | 1545 | "N14228" | "EWR" | "IAH" | 227.0 | 1400 | 5 | 15 | "2013-01-01T10:… |
-16.0 | 374.273128 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | "UA" | 1714 | "N24211" | "LGA" | "IAH" | 227.0 | 1416 | 5 | 29 | "2013-01-01T10:… |
-31.0 | 408.375 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | "AA" | 1141 | "N619AA" | "JFK" | "MIA" | 160.0 | 1089 | 5 | 40 | "2013-01-01T10:… |
17.0 | 516.721311 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | "B6" | 725 | "N804JB" | "JFK" | "BQN" | 183.0 | 1576 | 5 | 45 | "2013-01-01T10:… |
19.0 | 394.137931 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | "DL" | 461 | "N668DN" | "LGA" | "ATL" | 116.0 | 762 | 6 | 0 | "2013-01-01T11:… |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
null | null | 2013 | 9 | 30 | null | 1455 | null | null | 1634 | null | "9E" | 3393 | null | "JFK" | "DCA" | null | 213 | 14 | 55 | "2013-09-30T18:… |
null | null | 2013 | 9 | 30 | null | 2200 | null | null | 2312 | null | "9E" | 3525 | null | "LGA" | "SYR" | null | 198 | 22 | 0 | "2013-10-01T02:… |
null | null | 2013 | 9 | 30 | null | 1210 | null | null | 1330 | null | "MQ" | 3461 | "N535MQ" | "LGA" | "BNA" | null | 764 | 12 | 10 | "2013-09-30T16:… |
null | null | 2013 | 9 | 30 | null | 1159 | null | null | 1344 | null | "MQ" | 3572 | "N511MQ" | "LGA" | "CLE" | null | 419 | 11 | 59 | "2013-09-30T15:… |
null | null | 2013 | 9 | 30 | null | 840 | null | null | 1020 | null | "MQ" | 3531 | "N839MQ" | "LGA" | "RDU" | null | 431 | 8 | 40 | "2013-09-30T12:… |
The above works well for putting new columns at the start or end of a DataFrame. Doesn’t seem like there’s an easy way to place them next to a desired column.
4.3.2 select()
Another shared name here, with similar functionality!
"year", "month", "day") flights.select(
year | month | day |
---|---|---|
i64 | i64 | i64 |
2013 | 1 | 1 |
2013 | 1 | 1 |
2013 | 1 | 1 |
2013 | 1 | 1 |
2013 | 1 | 1 |
… | … | … |
2013 | 9 | 30 |
2013 | 9 | 30 |
2013 | 9 | 30 |
2013 | 9 | 30 |
2013 | 9 | 30 |
I couldn’t find a simple way of selecting a range of columns based on their name. It’s split up into two parts here.
= flights.columns[flights.find_idx_by_name("year"):flights.find_idx_by_name("day") + 1]
tcols flights.select(tcols)
year | month | day |
---|---|---|
i64 | i64 | i64 |
2013 | 1 | 1 |
2013 | 1 | 1 |
2013 | 1 | 1 |
2013 | 1 | 1 |
2013 | 1 | 1 |
… | … | … |
2013 | 9 | 30 |
2013 | 9 | 30 |
2013 | 9 | 30 |
2013 | 9 | 30 |
2013 | 9 | 30 |
To select everything but the range of of columns we invoke a selector.
~cs.by_name(tcols)) flights.select(
dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
f64 | i64 | f64 | f64 | i64 | f64 | str | i64 | str | str | str | f64 | i64 | i64 | i64 | str |
517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | "UA" | 1545 | "N14228" | "EWR" | "IAH" | 227.0 | 1400 | 5 | 15 | "2013-01-01T10:… |
533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | "UA" | 1714 | "N24211" | "LGA" | "IAH" | 227.0 | 1416 | 5 | 29 | "2013-01-01T10:… |
542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | "AA" | 1141 | "N619AA" | "JFK" | "MIA" | 160.0 | 1089 | 5 | 40 | "2013-01-01T10:… |
544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | "B6" | 725 | "N804JB" | "JFK" | "BQN" | 183.0 | 1576 | 5 | 45 | "2013-01-01T10:… |
554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | "DL" | 461 | "N668DN" | "LGA" | "ATL" | 116.0 | 762 | 6 | 0 | "2013-01-01T11:… |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
null | 1455 | null | null | 1634 | null | "9E" | 3393 | null | "JFK" | "DCA" | null | 213 | 14 | 55 | "2013-09-30T18:… |
null | 2200 | null | null | 2312 | null | "9E" | 3525 | null | "LGA" | "SYR" | null | 198 | 22 | 0 | "2013-10-01T02:… |
null | 1210 | null | null | 1330 | null | "MQ" | 3461 | "N535MQ" | "LGA" | "BNA" | null | 764 | 12 | 10 | "2013-09-30T16:… |
null | 1159 | null | null | 1344 | null | "MQ" | 3572 | "N511MQ" | "LGA" | "CLE" | null | 419 | 11 | 59 | "2013-09-30T15:… |
null | 840 | null | null | 1020 | null | "MQ" | 3531 | "N839MQ" | "LGA" | "RDU" | null | 431 | 8 | 40 | "2013-09-30T12:… |
# same as using `where(is.character)` flights.select(cs.string())
carrier | tailnum | origin | dest | time_hour |
---|---|---|---|---|
str | str | str | str | str |
"UA" | "N14228" | "EWR" | "IAH" | "2013-01-01T10:… |
"UA" | "N24211" | "LGA" | "IAH" | "2013-01-01T10:… |
"AA" | "N619AA" | "JFK" | "MIA" | "2013-01-01T10:… |
"B6" | "N804JB" | "JFK" | "BQN" | "2013-01-01T10:… |
"DL" | "N668DN" | "LGA" | "ATL" | "2013-01-01T11:… |
… | … | … | … | … |
"9E" | null | "JFK" | "DCA" | "2013-09-30T18:… |
"9E" | null | "LGA" | "SYR" | "2013-10-01T02:… |
"MQ" | "N535MQ" | "LGA" | "BNA" | "2013-09-30T16:… |
"MQ" | "N511MQ" | "LGA" | "CLE" | "2013-09-30T15:… |
"MQ" | "N839MQ" | "LGA" | "RDU" | "2013-09-30T12:… |
Many of the selection helpers are the same in polars
:
- cs.starts_with(“abc”)
- cs.ends_with(“xyz”)
- cs.contains(“ijk”)
No equivalent to num_range()
that I could find, however.
Renaming and selection at the same time is possible though.
= "tailnum") flights.select(tail_num
tail_num |
---|
str |
"N14228" |
"N24211" |
"N619AA" |
"N804JB" |
"N668DN" |
… |
null |
null |
"N535MQ" |
"N511MQ" |
"N839MQ" |
4.3.3 rename()
rename()
in polars
takes a dictionary.
"tailnum": "tail_num"}) flights.rename({
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tail_num | origin | dest | air_time | distance | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | f64 | i64 | f64 | f64 | i64 | f64 | str | i64 | str | str | str | f64 | i64 | i64 | i64 | str |
2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | "UA" | 1545 | "N14228" | "EWR" | "IAH" | 227.0 | 1400 | 5 | 15 | "2013-01-01T10:… |
2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | "UA" | 1714 | "N24211" | "LGA" | "IAH" | 227.0 | 1416 | 5 | 29 | "2013-01-01T10:… |
2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | "AA" | 1141 | "N619AA" | "JFK" | "MIA" | 160.0 | 1089 | 5 | 40 | "2013-01-01T10:… |
2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | "B6" | 725 | "N804JB" | "JFK" | "BQN" | 183.0 | 1576 | 5 | 45 | "2013-01-01T10:… |
2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | "DL" | 461 | "N668DN" | "LGA" | "ATL" | 116.0 | 762 | 6 | 0 | "2013-01-01T11:… |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2013 | 9 | 30 | null | 1455 | null | null | 1634 | null | "9E" | 3393 | null | "JFK" | "DCA" | null | 213 | 14 | 55 | "2013-09-30T18:… |
2013 | 9 | 30 | null | 2200 | null | null | 2312 | null | "9E" | 3525 | null | "LGA" | "SYR" | null | 198 | 22 | 0 | "2013-10-01T02:… |
2013 | 9 | 30 | null | 1210 | null | null | 1330 | null | "MQ" | 3461 | "N535MQ" | "LGA" | "BNA" | null | 764 | 12 | 10 | "2013-09-30T16:… |
2013 | 9 | 30 | null | 1159 | null | null | 1344 | null | "MQ" | 3572 | "N511MQ" | "LGA" | "CLE" | null | 419 | 11 | 59 | "2013-09-30T15:… |
2013 | 9 | 30 | null | 840 | null | null | 1020 | null | "MQ" | 3531 | "N839MQ" | "LGA" | "RDU" | null | 431 | 8 | 40 | "2013-09-30T12:… |
4.3.4 relocate()
???
relocate()
I can’t find a relocate()
equivalent in polars
. I think you’d have to get clever with .select()
and pl.all()
to make it happen.
4.5.1 group_by()
The workhorse.
"month") flights.group_by(
<polars.dataframe.group_by.GroupBy at 0x105d2f790>
4.5.2 summarize()
agg()
summarize()
A thing to note about polars
is most of the aggregating functions ignore missing data by default. Coming from R, you’d expect NA
results unless you explicitly ignore NA
s.
(
flights"month")
.group_by(
.agg(= pl.mean("dep_delay") # Ignores missing by default
avg_delay
)"month") # To compare to book
.sort( )
month | avg_delay |
---|---|
i64 | f64 |
1 | 10.036665 |
2 | 10.816843 |
3 | 13.227076 |
4 | 13.938038 |
5 | 12.986859 |
… | … |
8 | 12.61104 |
9 | 6.722476 |
10 | 6.243988 |
11 | 5.435362 |
12 | 16.576688 |
(
flights"month")
.group_by(
.agg(= pl.mean("dep_delay"),
avg_delay = pl.count()
n
)"month") # To compare to book.
.sort( )
month | avg_delay | n |
---|---|---|
i64 | f64 | u32 |
1 | 10.036665 | 27004 |
2 | 10.816843 | 24951 |
3 | 13.227076 | 28834 |
4 | 13.938038 | 28330 |
5 | 12.986859 | 28796 |
… | … | … |
8 | 12.61104 | 29327 |
9 | 6.722476 | 27574 |
10 | 6.243988 | 28889 |
11 | 5.435362 | 27268 |
12 | 16.576688 | 28135 |
4.5.3 The slice_
window
functions
slice_
This section is the largest departure so far from dplyr
. In polars
we use window functions (like SQL) to perform the type of computations implemented via slice_
.
(
flights filter(
."arr_delay") == pl.col("arr_delay").max().over("dest")
pl.col(
)
.select("dest"),
pl.col("*").exclude("dest") # Example of getting around a lack of `relocate()`.
pl.col(
)"dest") # For book comparison
.sort( )
dest | year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | air_time | distance | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | i64 | i64 | i64 | f64 | i64 | f64 | f64 | i64 | f64 | str | i64 | str | str | f64 | i64 | i64 | i64 | str |
"ABQ" | 2013 | 7 | 22 | 2145.0 | 2007 | 98.0 | 132.0 | 2259 | 153.0 | "B6" | 1505 | "N784JB" | "JFK" | 259.0 | 1826 | 20 | 7 | "2013-07-23T00:… |
"ACK" | 2013 | 7 | 23 | 1139.0 | 800 | 219.0 | 1250.0 | 909 | 221.0 | "B6" | 1491 | "N192JB" | "JFK" | 44.0 | 199 | 8 | 0 | "2013-07-23T12:… |
"ALB" | 2013 | 1 | 25 | 123.0 | 2000 | 323.0 | 229.0 | 2101 | 328.0 | "EV" | 4309 | "N13908" | "EWR" | 30.0 | 143 | 20 | 0 | "2013-01-26T01:… |
"ANC" | 2013 | 8 | 17 | 1740.0 | 1625 | 75.0 | 2042.0 | 2003 | 39.0 | "UA" | 887 | "N528UA" | "EWR" | 404.0 | 3370 | 16 | 25 | "2013-08-17T20:… |
"ATL" | 2013 | 7 | 22 | 2257.0 | 759 | 898.0 | 121.0 | 1026 | 895.0 | "DL" | 2047 | "N6716C" | "LGA" | 109.0 | 762 | 7 | 59 | "2013-07-22T11:… |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
"TPA" | 2013 | 4 | 10 | 1100.0 | 1900 | 960.0 | 1342.0 | 2211 | 931.0 | "DL" | 2391 | "N959DL" | "JFK" | 139.0 | 1005 | 19 | 0 | "2013-04-10T23:… |
"TUL" | 2013 | 1 | 20 | 2347.0 | 1936 | 251.0 | 230.0 | 2208 | 262.0 | "EV" | 4333 | "N14177" | "EWR" | 208.0 | 1215 | 19 | 36 | "2013-01-21T00:… |
"TVC" | 2013 | 8 | 10 | 1120.0 | 730 | 230.0 | 1323.0 | 943 | 220.0 | "EV" | 5144 | "N718EV" | "LGA" | 96.0 | 655 | 7 | 30 | "2013-08-10T11:… |
"TYS" | 2013 | 4 | 12 | 51.0 | 2000 | 291.0 | 250.0 | 2209 | 281.0 | "EV" | 3826 | "N15986" | "EWR" | 103.0 | 631 | 20 | 0 | "2013-04-13T00:… |
"XNA" | 2013 | 3 | 8 | 1311.0 | 822 | 289.0 | 1615.0 | 1045 | 330.0 | "EV" | 4140 | "N18556" | "EWR" | 168.0 | 1131 | 8 | 22 | "2013-03-08T13:… |
4.5.4 Grouping by multiple variables
= flights.group_by("year", "month", "day") daily
daily.count()
year | month | day | count |
---|---|---|---|
i64 | i64 | i64 | u32 |
2013 | 1 | 7 | 933 |
2013 | 1 | 14 | 928 |
2013 | 1 | 27 | 823 |
2013 | 10 | 10 | 994 |
2013 | 10 | 17 | 995 |
… | … | … | … |
2013 | 9 | 17 | 961 |
2013 | 9 | 20 | 994 |
2013 | 9 | 23 | 993 |
2013 | 9 | 26 | 996 |
2013 | 9 | 29 | 914 |
4.5.5 Ungrouping
No takebacks on grouping in polars
. However, groupings do not persist after aggregation.
type(daily)
polars.dataframe.group_by.GroupBy
type(daily.count())
polars.dataframe.frame.DataFrame
4.5.6 .by
Nothing to do here.
4.6 Case study: aggregates and sample size
For fun, let’s replicate this code as well.
from teqniqly.lahman_datasets import LahmanDatasets
= LahmanDatasets()
ld
ld.load()
= next(filter(lambda x: x.endswith("Batting"), ld.dataframe_names)) batting_loc
= (
batters
pl.from_pandas(ld[batting_loc])"playerID")
.group_by(
.agg(= pl.col("H").sum() / pl.col("AB").sum(),
performance = pl.col("AB").sum()
n
)
)
batters
playerID | performance | n |
---|---|---|
str | f64 | i64 |
"martijd01" | 0.162162 | 37 |
"peterja01" | 0.230807 | 2019 |
"clintlo01" | 0.247097 | 2153 |
"smithca02" | NaN | 0 |
"dumovni01" | 0.241379 | 29 |
… | … | … |
"wardja01" | 0.163265 | 49 |
"clairda01" | 0.142857 | 7 |
"ogdenja01" | 0.2 | 170 |
"bilkost01" | 0.248562 | 1738 |
"odayha01" | 0.189718 | 817 |
Plotting isn’t really the focus of this project, but let’s try replicating this simple chart. This also allows us to try out the .pipe()
method of a polars
DataFrame.
import altair as alt
"vegafusion")
alt.data_transformers.enable(
= (
chart
battersfilter(pl.col("n") > 100)
.
.pipe(alt.Chart)="n", y="performance")
.encode(x="black", opacity=1 / 10)
.mark_point(color='container')
.properties(width
)
+ chart.transform_loess('n', 'performance').mark_line() chart