Back
Linklog
This is my Linklog where I keep links to my favorite content on the Internet. It's a great way to keep this content as reference while also sharing articles and tools I encounter and find interesting.
March 2026
- What Category Theory Teaches Us About DataFrames (mchav.github.io)
- #data
An interesting perspective on the abstraction behind the dataframe algebra.
- When Data Lies: Finding Optimal Strategies for Penalty Kicks with Game Theory (towardsdatascience.com)
- #data
This article highlights a very interesting aspect of analysing historical data. In this case, center penalty kicks appear surprisingly successful. What this article shows is that one shouldn't blindly assume that it's because they are inherently more effective (they aren't), it's because goalkeepers act suboptimally and the player-goalkeeper interaction is responsible for this overrepresentation.
February 2026
- Ten years late to the dbt party (DuckDB edition) (rmoff.net)
- #data#data-engineering
If like me you feel late to the dbt party, we're not alone!
May 2025
- Are you more likely to die on your birthday? (pudding.cool)
- #data#statistics
A fun analysis of the birthday effect using actual data and thorough methodology.
- dataframely — A declarative, 🐻❄️-native data frame validation library (tech.quantco.com)
- #data#library
I've been working a lot on our data pipelines at work, switching to polars mostly for performance and introducing rigorous checks and validations of data at various stages. I haven't yet used dataframely, but its principle really resonates with my use case, so I recommend checking it out.
March 2025
- Succinct data structures (blog.startifact.com)
- #data
Succinct data structures are clever ways to pack a lot of information in lightweight structures like bit vectors. A very interesting read!
February 2025
- Binary vector embeddings are so cool (emschwartz.me)
- #llm#deep-learning#data
A description of the effect of binary quantization on embeddings. By restricting the dtype of embedding vectors, you can get a tradeoff between accuracy in latent space and size of the embedding. Using binary dtype seems to conserve a surprisingly high amount of the original information content (about 97%) while yielding a gigantic amount of saving in space (about 97% too here).
January 2025
- Data Contracts as Therapy (benrutter.github.io)
- #data
Musings about the use of data contracts to validate data sources. If you've ever been frustrated by a data source suddenly changing its schema or sending unexpected data, this is for you!
- Polars for initial data analysis, Polars for production (pythonspeed.com)
- #python#data
Article about the use of Polars for both production and development stages. When starting with Polars, I found it easy to write production code (usually a long pipeline of LazyFrames ending with a collect), but struggled with writing optimal development code.
- Modern Polars (kevinheavey.github.io)
- #python#data
Great online book about Polars targeted to Pandas users. If you haven't heard about Polars yet, do yourself a favor and read this.
October 2024
- First aid for figures: all resources (helenajamborwrites.netlify.app)
- #data
A collection of resources to help make better data visualizations. Definitely useful as a refresher or reference before making a report or a presentation.
September 2024
- Was Michael Scott the World’s Best Boss? (datacream.substack.com)
- #data
I always love when data scientists take it too far on their hobbies. This is a cool example of data science applied to "The Office", to figure out through sentiment analysis if Michael Scott was truly appreciated.
August 2024
- Column Names as Contracts (emilyriederer.netlify.app)
- #best-practices#data
An interesting explanation of implicit data contracts through naming conventions.