Sometimes I Need to Store a Lot of Strings Efficiently, So I Built lite-strtab
This post is about a very specific problem.
Storing many short, long-lived strings in an immutable (read-only) container.
All while keeping memory overhead low and lookup performance practical.
Every once in a while I run into the same problem: I have a lot of strings to keep, and I want to save a bunch of memory.
Think of stuff like:
- A
catalogordatabasesnapshot you download to see what is available - Your game's localisation strings
- Static lists of paths, endpoints, or keys
Something you load once, keep around for a while, and query every once in a while.
Motivation
As you may know, I have solo dev'd 200+ projects (very few are forks).
My more recent ones made in the last 2-3 years share a common layout/template.
And even a nice Rust manual!
However, keeping all of these repositories up to date has been difficult; you make one template change, and now off you go patching 30+ repos. Oof!
Tech support is a similar story; I'm mostly solo (unfortunately), and questions from users needing help are a daily occurrence, all hours of the day. That means I can't always respond right away 😔; I may be asleep.
I usually aim to respond within 10 minutes.
If I don't reply immediately, I might forget. Even though it's all unpaid work.
So I've been investing in building some LLM-based tooling to help me with these aspects.
Help me migrate repos and catch questions while I'm sleeping 💤.
Why existing tools don't work for me?
Resource usage. OpenCode idles at 300MB of private bytes, and 500MB in use.
And that's in server mode WITHOUT the TUI.
Other tools; almost all non-native, built on TS or Python, are similar.
Including those built by nearly half a trillion companies (e.g. claude).
I'm going for ~1MB instead.
Where the Strings Come In
I wanted to parse models.dev/api.json, because I need
knobs like env vars and api endpoints.
I also want to keep that in memory for broad model/provider support.
That unfortunately includes unsized string data; and with that caveats arise. It is best to illustrate by example.
Unless stated otherwise, assume 64-bit machine everywhere.
My Data Structure
I start with a structure-of-arrays layout to save memory on struct padding etc.
It looks approximately like (simplified):
/// Runtime lookup catalog with split provider and model tables.
#[derive(Debug, Clone)]
pub struct ModelCatalog {
/// Provider key lookup table.
provider_table: HashTable<PackedProviderTableEntry>,
/// Model key lookup table.
model_table: HashTable<PackedModelTableEntry>,
/// Indexed by index from provider_table (provider_idx)
provider_api_urls: Box<[Box<str>]>,
provider_env_var_pool: Box<[TinyVec<[Box<str>; 1]>]>,
provider_entries: Box<[PackedProviderEntry]>,
/// Indexed by index from model_table (model_config_idx)
model_entries: Box<[PackedModelEntry]>,
model_config_entries: Option<Box<[PackedModelConfigEntry]>>,
}
/// Packed model-table entry.
#[bitfield(u64)] // 8 byte struct! Amended by macro.
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub(crate) struct PackedModelTableEntry {
#[bits(48)]
hash48: u64, // 48 bit hash
model_config_idx: u16,
}
/// Packed provider-table entry.
#[bitfield(u64)] // 8 byte struct! Amended by macro.
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub(crate) struct PackedProviderTableEntry {
#[bits(48)]
hash48: u64, // 48 bit hash
provider_idx: u16,
}
/// Packed provider metadata row.
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub(crate) struct PackedProviderEntry {
provider_type: ProviderType, // 8 bits
}
The Problems
I'll go block-by-block.
1. Box<str> handle tax
provider_api_urls: Box<[Box<str>]>,
On 64-bit, each Box<str> handle is 16 bytes (ptr + len).
In this snapshot, env name length is median 16 (mean 16.47), so for example:
"DEEPSEEK_API_KEY" payload: 16 bytes
Box<str> handle: 16 bytes
In other words, the handle itself can be as big as the heap allocation.
2. Padding from heap allocations
Each regular 'String' is a separate heap allocation.
And with every allocation there is extra overhead:
- Padding/alignment: a
12-byte payload may occupy a16-byte size-class bucket. - Internal metadata: allocator bookkeeping adds bytes beyond payload size.
Exact behaviour varies with allocator & platform.
In short, if you allocate short objects, like a 12 byte string; don't be surprised if a little bit more is actually used due to allocator internals.
3. Heap locality is not guaranteed
Even within a single array like Box<[Box<str>]>, each Box<str> points to
its own heap allocation, and those allocations are not guaranteed to be adjacent
or in the same region.
# Example 1: unrelated allocation can interleave
api_url("https://...") -> 48B bucket @ 0x...9000
other allocation -> 48B bucket @ 0x...9040 # unrelated value inserted in between
doc_url("https://docs...") -> 48B bucket @ 0x...9080
# Example 2: very similar env keys can split by size class
env[0]("DEEPSEEK_API_KEY") # 16 chars -> 16B-class bucket @ 0x...7000
env[1]("DEEPINFRA_API_KEY") # 17 chars -> next size-class bucket @ 0x...b000
This hurts cache locality; may require additional paging, etc.
This is also why a 'bump allocator' or 'arena' is not the play here
4. Small arrays are expensive
provider_env_var_pool: Box<[TinyVec<[Box<str>; 1]>]>,
api.json snapshot (96 providers):
87providers have1env key5providers have24providers have3
So, most providers have a single env entry; but because >1 is possible, we need
to support that case too. However, this isn't so trivial.
TinyVec<[Box<str>; 1]>uses 32 bytes per array/slice entry; and sometimes spills to heap (for 2+ keys).TinyVec<[Box<str>; 3]>uses 56 bytes per array/slice entry; but usually only 1 entry is used.Box<[Box<str>]>uses 16 bytes per array/slice entry, but also an extra 16 entry for slice info itself on heap.
Now consider the fact that the median/mean env name length is... 16 bytes.
Way more memory is used for the handles than for the data.
A regular string reference is expensive enough; but with small arrays, doubly so!!
| layout | bytes per 1-key row (requested) | total usable bytes (glibc) |
|---|---|---|
TinyVec<[Box<str>; 1]> |
32 |
3,568 |
TinyVec<[Box<str>; 3]> |
56 |
5,384 |
Box<[Box<str>]> |
32 (16 + 16) |
4,056 |
The Core Idea
lite-strtab is kind of like a 'string interner', or 'arena allocator', but with slightly different tradeoffs.
Think of first interning all the strings, and then 'freezing' the table for read-only access. This 'freezing' performs some optimizations.
lite-strtab stores all UTF-8 bytes in one contiguous buffer and keeps one
offset table.
(len + 1 entries, with a final sentinel).
Lookup is straightforward:
- Read
offsets[i]andoffsets[i + 1] - Slice
bytes[start..end] - Return
&str
No per-string heap allocation, and IDs can be compact (StringId<u16> etc.).
use lite_strtab::StringTableBuilder;
let mut builder = StringTableBuilder::new();
let hello = builder.try_push("hello")?;
let world = builder.try_push("world")?;
let table = builder.build();
assert_eq!(table.get(hello), Some("hello"));
assert_eq!(table.get(world), Some("world"));
# Ok::<(), lite_strtab::Error>(())
Memory Layout Diagrams
Before: one allocation per string
flowchart TB
subgraph Handles["Container handles"]
H0["entry 0: String or Box<str> handle"]
H1["entry 1: String or Box<str> handle"]
HN["entry N: String or Box<str> handle"]
end
H0 --> A0["heap alloc: bytes for string 0"]
H1 --> A1["heap alloc: bytes for string 1"]
HN --> AN["heap alloc: bytes for string N"]
After: one byte buffer + compact offsets
flowchart TB
IDS["IDs in your structs<br/>StringId<u16> (2B each)"] --> OFF["offsets (u32/u16)<br/>len + 1 entries"]
OFF --> BYTES["one contiguous UTF-8 byte buffer"]
offsets: [0, 5, 10]
bytes: [h e l l o w o r l d]
id 0 -> bytes[0..5] -> "hello"
id 1 -> bytes[5..10] -> "world"
Handle and offset size comparison
| Setting | Reference Size | Limits what? | Practical limit in lite-strtab |
|---|---|---|---|
String |
24 bytes | N/A | Owned, growable string handle |
Box<str> |
16 bytes | N/A | Owned, immutable string handle |
I = u8 |
1 byte | String count | Up to 256 strings (0..=255) |
I = u16 |
2 bytes | String count | Up to 65,536 strings (0..=65,535) |
I = u32 |
4 bytes | String count | Up to 4,294,967,296 strings |
O = u16 |
2 bytes | Total UTF-8 byte size | Up to 65,535 total bytes |
O = u32 |
4 bytes | Total UTF-8 byte size | Up to about 4 GiB total bytes |
Most common setup: I = u16 + O = u32
This means 4GiB of UTF-8 data, and 64Ki entries. So every time you store a reference to a string inside a struct, it uses 2 bytes, instead of 16/24. This gives you massive memory savings when combined with arrays of strings.
Improvement with lite-strtab
Not final code, current plan.
Start with the new ModelCatalog shape:
/// Simplified layout, focusing on string-bearing fields.
pub struct ModelCatalog {
provider_table: HashTable<PackedProviderTableEntry>,
model_table: HashTable<PackedModelTableEntry>,
// provider_idx is reused as the key
provider_api_urls: StringTable, // defaults: O = u32 (offset), I = u16 (StringId)
// env keys are grouped per provider in this table
provider_env_keys: StringTable,
provider_env_ranges: Box<[PackedEnvRange]>,
provider_entries: Box<[PackedProviderEntry]>,
model_entries: Box<[PackedModelEntry]>,
model_config_entries: Option<Box<[PackedModelConfigEntry]>>,
}
#[bitfield(u16)] // each range entry is 16 bits (2 bytes)
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub struct PackedEnvRange {
#[bits(14)]
start: u16, // first env key index in provider_env_keys
#[bits(2)]
count: u8, // number of env keys (0..=3)
}
This means:
provider_idx-> API URL is direct (provider_api_urls[provider_idx])provider_idx-> env keys isprovider_env_ranges[provider_idx]->(start, count)inprovider_env_keys
1. API URL indexing Memory Savings
Before:
provider_api_urls: Box<[Box<str>]>,
After:
provider_api_urls: StringTable,
The existing provider_idx (u16) from the hashtable is now the key into the
StringTable, so the effective per-entry metadata is 6 bytes (2-byte index + 4-byte offset).
Formula notes:
96= number of providers in this snapshot.16= oneBox<str>handle on 64-bit (ptr + len).2 + 4= compact metadata model (u16index +u32offset).- Ratio column is
after / before.
On this snapshot (96 providers):
| Step | Formula | Bytes | Ratio |
|---|---|---|---|
Before (Box<str> handles) |
96 * 16 |
1,536 B |
1.00x |
After (StringTable metadata) |
96 * (2 + 4) |
576 B |
0.38x |
| Saved | 1,536 - 576 |
960 B |
- |
This table only counts the indexing/reference layer, not UTF-8 payload bytes.
2. Environment Variable Memory Savings
Before:
provider_env_var_pool: Box<[TinyVec<[Box<str>; 1]>]>,
After:
provider_env_ranges: Box<[PackedEnvRange]>,
provider_env_keys: StringTable,
The existing provider_idx (u16) from the hashtable is now the key into the
provider_env_ranges, which gives us offset+length into StringTable.
Formula notes:
32= oneTinyVec<[Box<str>; 1]>row.16= oneBox<str>handle on 64-bit.87 / 5 / 4are the provider counts with1 / 2 / 3env keys.Xis allocator overhead from spill allocations (size-class rounding + metadata).- Ratio values in comparison are
after / before.
Given 96 providers, 109 total env keys:
Before (TinyVec) path breakdown:
| Component | Formula | Bytes |
|---|---|---|
| Per-provider rows | 96 * 32 |
3,072 B |
| Spilled handles (2-key providers) | 5 * (2 * 16) |
160 B |
| Spilled handles (3-key providers) | 4 * (3 * 16) |
192 B |
| Total requested | 3,072 + 160 + 192 |
3,424 B |
| Allocator overhead from spills | +X |
+144 B |
| Total usable on glibc | 3,424 + 144 |
3,568 B |
After (PackedEnvRange + StringTable) breakdown:
| Component | Formula | Bytes |
|---|---|---|
| Provider ranges | 96 * 2 |
192 B |
| Per-env-key index+len component | 109 * 2 |
218 B |
| Per-env-key offset component | 109 * 4 |
436 B |
Total requested (after) |
192 + 218 + 436 |
846 B |
Comparison:
| Layout | Bytes | Saved vs after | Ratio |
|---|---|---|---|
Before (TinyVec, requested) |
3,424 B |
2,578 B |
0.25x |
Before (Box<[Box<str>]>) |
3,280 B |
2,434 B |
0.26x |
After (PackedEnvRange + StringTable) |
846 B |
- |
1.00x |
3. Heap Savings
From benchmark results (malloc_usable_size, Linux/glibc):
| Dataset | lite-strtab |
Box<[Box<str>]> |
Vec<String> |
Saving | Ratio |
|---|---|---|---|---|---|
EnvKeys |
2,240 B |
2,728 B |
2,888 B |
488 B |
0.82x |
ApiUrls |
4,352 B |
4,672 B |
4,736 B |
320 B |
0.93x |
| TOTAL | 6,592 B |
7,400 B |
7,624 B |
808 B |
0.89x |
4. Combined impact
| Layer (derived from sections) | Before | After | Saving | Ratio |
|---|---|---|---|---|
1. API URL indexing |
1,536 B |
576 B |
960 B |
0.38x |
2. Env indexing (best baseline) |
3,424 B |
846 B |
2,578 B |
0.25x |
3. Heap (TOTAL) |
7,400 B |
6,592 B |
808 B |
0.89x |
| TOTAL (1 + 2 + 3) | 12,360 B |
8,014 B |
4,346 B |
0.65x |
You might think. Damn, this person just did all this work just to save 4K...
Yes.
We'll save in other places in the future too 😉, but I might aswell write the library now, instead of later going to old code and rewrite it for the library.
Extra Benchmarks
Full benchmark tables and dataset details (full README)
This is a short summary of the crate info.
What about a larger dataset?
- YakuzaKiwami (4,650 relative game paths, 238,109 bytes).
String reference cost in structs
You can store references to strings very very cheaply!! 💜
| Representation | Bytes | Ratio |
|---|---|---|
lite-strtab |
9,300 B |
1.00x |
Box<[Box<str>]> |
74,400 B |
8.00x |
Vec<String> |
111,600 B |
12.00x |
For 4,650 strings...
Box<[Box<str>]>:
- Struct field: 16 B × 4,650 = 74 KB
lite-strtab:
- Struct field: 2 B × 4,650 = 9 KB
String references in your structs cost 2 bytes instead of 16-24.
Heap allocation size
Due to less alignment, we also save some space on the heap
| Representation | Bytes | Ratio |
|---|---|---|
lite-strtab |
256,736 B |
1.00x |
Box<[Box<str>]> |
272,528 B |
1.06x |
Vec<String> |
272,640 B |
1.06x |
A neat side effect to more manual alignment control.
Heap memory numbers were measured with glibc malloc_usable_size on Linux.
This accounts for allocator induced alignment overhead, we're keeping it real ❤️🔥
Total memory usage (struct refs + heap)
Based on the two tables above
| Representation | Struct refs | Heap alloc | Total bytes | Ratio |
|---|---|---|---|---|
lite-strtab |
9,300 B |
256,736 B |
266,036 B |
1.00x |
Box<[Box<str>]> |
74,400 B |
272,528 B |
346,928 B |
1.30x |
Vec<String> |
111,600 B |
272,640 B |
384,240 B |
1.44x |
Read performance (YakuzaKiwami, sequential pass + AHash)
| Access | Representation | avg time (us) | avg thrpt (GiB/s) |
|---|---|---|---|
get |
Vec<String> |
13.561 | 16.352 |
get |
Box<[Box<str>]> |
13.002 | 17.056 |
get |
lite-strtab |
13.368 | 16.589 |
get |
lite-strtab (null-padded) |
13.714 | 16.171 |
get_unchecked |
Vec<String> |
13.448 | 16.490 |
get_unchecked |
Box<[Box<str>]> |
12.812 | 17.308 |
get_unchecked |
lite-strtab |
13.207 | 16.790 |
get_unchecked |
lite-strtab (null-padded) |
13.828 | 16.037 |
In practice, this is near parity for a realistic read-and-process workload.
Assembly-level lookup cost (x86_64, release)
As an extra note. Even though we don't have ptr+len handles, deriving them is
not that expensive 😉.
Here's x86_64 instruction counts for getting a &str.
| Method | Instructions | Access pattern |
|---|---|---|
lite-strtab::get |
~12 | bounds check -> load 2 offsets -> compute range -> add base |
lite-strtab::get_unchecked |
~7 | load 2 offsets -> compute range -> add base |
Vec<String>::get |
~8 | bounds check -> load ptr from heap -> deref for (ptr, len) |
Vec<String>::get_unchecked |
~5 | load ptr from heap -> deref for (ptr, len) |
Box<[Box<str>]>::get |
~7 | bounds check -> load ptr -> deref for (ptr, len) |
Box<[Box<str>]>::get_unchecked |
~4 | load ptr -> deref for (ptr, len) |
Extra benchmark notes
- Current published benchmarks are sequential access patterns.
- Random access patterns may behave differently and need dedicated tests.
Closing
If your workload has lots of long-lived immutable strings, this data structure shape is worth considering.
It is simple, predictable, and does one job well.
The sources and docs are here: