Post

Convert Ragged Nested Sequences to MultiIndex Pandas DataFrame

What’s a Ragged Nested Sequence?

A ragged nested sequence is defined in Python as a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes.

Essentially it’s an array of arrays where the sub-arrays are of different lengths.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[
    [
        ['high', 'med', '3', '2', 'med', 'low'],
        ['med', 'low', '5more', '2', 'big', 'med'],
        ['vhigh', 'vhigh', '2', '2', 'med', 'low'],
        ['high', 'med', '4', '2', 'big', 'low']
    ],
    [
        ['med', 'low', '3', '4', 'med', 'med'],
        ['med', 'low', '4', '4', 'med', 'low'],
        ['low', 'low', '2', '4', 'big', 'med']
    ],
    [
        ['med', 'vhigh', '4', 'more', 'small', 'high'],
        ['med', 'med', '2', 'more', 'big', 'high'],
        ['med', 'med', '2', 'more', 'med', 'med']
    ]
]

This example was taken from a decision trees tutorial’s cars dataset.

Why convert it into a DataFrame?

  • It’s difficult to work with in list form.
  • The sequence cannot be converted to ndarray directly since the lengths are unequal.
  • Visualization of the data is non-optimal.

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
...

inner_length = [len(split) for split in data]
keysOuter = []
keysInner = []

# keysOuter contains outer index (outer split) for all rows.
# keysInner contains inner index (numbering).
for ind in range(len(inner_length)):
    for i in range(inner_length[ind]):
        keysOuter += ['split' + f'{ind+1}']
        keysInner += [i]

# Create a flat DataFrame without hierarchy.
df = [item for row in data for item in row]
df = pd.DataFrame(df)
df['outer'] = keysOuter
df['inner'] = keysInner
# Set MultiIndex.
data = data.set_index(['outer', 'inner'])

DataFrame


outer

inner
0
1
2
3
4
5
split10highmed32medlow
 1medlow5more2bigmed
 2vhighvhigh22medlow
 3highmed42biglow
split20medlow34medmed
 1medlow44medlow
 2lowlow24bigmed
split30medvhigh4moresmallhigh
 1medmed2morebighigh
 2medmed2moremedmed

Final Thoughts

If you code in df.index, you get a MultiIndex output:

MultiIndex([('split1', 0),
            ('split1', 1),
            ('split1', 2),
            ('split1', 3),
            ('split2', 0),
            ('split2', 1),
            ('split2', 2),
            ('split3', 0),
            ('split3', 1),
            ('split3', 2)],
           names=['outer', 'inner'])
This post is licensed under CC BY 4.0 by the author.