Larger Than Memory Data Workflows with Apache Arrow and Ibis

As datasets become larger and more complex, the boundaries between data engineering and data science are becoming blurred. Data analysis pipelines with larger-than-memory data are becoming commonplace, creating a gap that needs to be bridged: between engineering tools designed to work with very large datasets on the one hand, and data science tools that provide the analysis capabilities used in data workflows on the other. One way to build this bridge is with Apache Arrow, a multi-language toolbox for working with larger-than-memory tabular data. Arrow is designed to improve performance and efficiency, and places emphasis on standardization and interoperability among workflow components, programming languages, and systems. We’ll combine the power of Arrow for compressing data into its most compact form with Ibis for data analysis in Python. Ibis is a pandas-like, Python interface that allows you to do robust analytics work on data of any size. It does this by letting you choose from a range of powerful database engines that can process your data queries. In this workshop we’ll be working with a very large dataset and drawing insights from it all from our local machine.

Marlene Mhangami

Marlene Mhangami

@marlene_zw
Developer Advocate

Marlene is a Zimbabwean software engineer, developer advocate and explorer. She is a previous director and vice-chair for the Python Software Foundation and is currently serving as the vice-chair of the Association for Computing Machinery practitioner board. In 2017, she co-founded ZimboPy, a non-profit organization that gives Zimbabwean young women access to resources in the field of technology. She is also the previous chair of PyCon Africa and is an advocate for women in tech on the continent. Professionally, Marlene is currently working as a Developer Advocate at Voltron Data.

What the attendees will learn

- How to access and convert large CSV files into Parquet using Apache Arrow (we'll specifically be using the PUMS census dataset) 

- How changing CSV files into parquet compresses them efficiently letting you work with them on your local machine 

- How to analyse larger-than-memory parquet files using Ibis 

- Drawing interesting insights from the PUMS census dataset efficiently

Requirements

A laptop with your favourite IDE.

Companies that use this technology

Clickhouse, Knime, Dremio, InfluxDB IOx, Omnisci, TileDB, Falcon

Workshop Plan

- Access data from PUMS dataset (working on potentially allowing attendees to access it through a shared jupyter lab so they don't need to download it) 

- Convert dataset to parquet 

- Clean up schema 

- Divide people and housing data

- Convert into two parquet files and compare size with original 

- Read in parquet files with Ibis 

- Analyse PUMS data drawing insights from it

- Done! 

Larger Than Memory Data Workflows with Apache Arrow and Ibis

Date and time:

10th.

11:00 - 13:00

Topics:

Data Engineering, Data Science

Target audience roles:

Data Engineers, Data Scientists

Attendees:

30

Included:

Self-Service Coffee
(This workshop is free for general ticket holders until the end of stock)
 


Sold out