Convert Ragged Nested Sequences to MultiIndex Pandas DataFrame

Posted Aug 31, 2023 Updated Sep 1, 2023

By Ruthran Chandrasekar 2 min read

What’s a Ragged Nested Sequence?

A ragged nested sequence is defined in Python as a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes.

Essentially it’s an array of arrays where the sub-arrays are of different lengths.

  
[
    [
        ['high', 'med', '3', '2', 'med', 'low'],
        ['med', 'low', '5more', '2', 'big', 'med'],
        ['vhigh', 'vhigh', '2', '2', 'med', 'low'],
        ['high', 'med', '4', '2', 'big', 'low']
    ],
    [
        ['med', 'low', '3', '4', 'med', 'med'],
        ['med', 'low', '4', '4', 'med', 'low'],
        ['low', 'low', '2', '4', 'big', 'med']
    ],
    [
        ['med', 'vhigh', '4', 'more', 'small', 'high'],
        ['med', 'med', '2', 'more', 'big', 'high'],
        ['med', 'med', '2', 'more', 'med', 'med']
    ]
]

This example was taken from a decision trees tutorial’s cars dataset.

Why convert it into a DataFrame?

It’s difficult to work with in list form.
The sequence cannot be converted to ndarray directly since the lengths are unequal.
Visualization of the data is non-optimal.

Code

  
...

inner_length = [len(split) for split in data]
keysOuter = []
keysInner = []

# keysOuter contains outer index (outer split) for all rows.
# keysInner contains inner index (numbering).
for ind in range(len(inner_length)):
    for i in range(inner_length[ind]):
        keysOuter += ['split' + f'{ind+1}']
        keysInner += [i]

# Create a flat DataFrame without hierarchy.
df = [item for row in data for item in row]
df = pd.DataFrame(df)
df['outer'] = keysOuter
df['inner'] = keysInner
# Set MultiIndex.
data = data.set_index(['outer', 'inner'])

DataFrame

outer	inner	0	1	2	3	4	5
split1	0	high	med	3	2	med	low
	1	med	low	5more	2	big	med
	2	vhigh	vhigh	2	2	med	low
	3	high	med	4	2	big	low
split2	0	med	low	3	4	med	med
	1	med	low	4	4	med	low
	2	low	low	2	4	big	med
split3	0	med	vhigh	4	more	small	high
	1	med	med	2	more	big	high
	2	med	med	2	more	med	med

Final Thoughts

If you code in df.index, you get a MultiIndex output:

MultiIndex([('split1', 0),
            ('split1', 1),
            ('split1', 2),
            ('split1', 3),
            ('split2', 0),
            ('split2', 1),
            ('split2', 2),
            ('split3', 0),
            ('split3', 1),
            ('split3', 2)],
           names=['outer', 'inner'])

Snippets, Code

python pandas

This post is licensed under CC BY 4.0 by the author.