data.lp

This file contains the various data processing-related constants and functions referenced by the tangling and weaving processes.

*:

(License)

(Imports)

(Processing limits)

(Formatting keywords)

(Configuration keywords)

(Data structure types)

(Error set)

(Line splitting function)

(Configuration searching function)

(Section searching function)

(Command type detection function)

(Parsing functions)

(Code generation functions)

(Text generation function)

License:

// Copyright 2022 DistressNetwork° <uplink@distress.network>
// This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.

Constants

We first import the standard library and the logging function from log.zig.

Imports:

const std = @import("std");
const log = @import("log.zig").log;

const Allocator = std.mem.Allocator;

We define the maximum input file size of 4GiB, and the code generation function’s maximum recursion depth of 250 nested calls.

Processing limits:

pub const input_max = 0x1_0000_0000;
pub const dereference_max = 250;

We then define the recognized formatting keywords. These consist of the following:

Formatting keywords:

pub const k_start   = "@: ";
pub const k_add     = "@+ ";
pub const k_end     = "@.";
pub const k_ref     = "@= ";
pub const k_root    = "*";

We similarly define the recognized configuration keywords, consisting of:

Configuration keywords:

pub const kc_start  = "@start ";
pub const kc_add    = "@add ";
pub const kc_end    = "@end ";
pub const kc_ref    = "@ref ";
pub const kc_esc    = "@@";
pub const kc_nl     = "\\n";

We then define the data structure used for parsing the input into code sections, described as follows:

Data structure types:

pub const Section = struct {
    name: []const u8,
    content: []const Content,
};

pub const Content = union(enum) {
    literal: LineRange,
    reference: []const u8,
};

pub const LineRange = struct {
    start: u32,
    end: u32,
};

We also define the set of errors which may be encountered by the various processing functions, consisting of:

Error set:

pub const Errors = error {
    UnexpectedStart,
    UnexpectedEnd,
    DereferenceLimit,
    NotFound,
};

Preprocessing & Searching

The line splitting function is defined, which operates on a buffer as follows.

Line splitting function:

pub fn split_lines(file: []const u8, alloc: Allocator) ![][]const u8 {
    var buffer = std.ArrayList([]const u8).init(alloc);
    defer buffer.deinit();

    (Split file at each newline)

    return buffer.toOwnedSlice();
}

The function simply iteratively splits the file at each newline, and appends each resulting line to the buffer.

Split file at each newline:

var iterator = std.mem.split(u8, file, "\n");
while (iterator.next()) |line| {
    try buffer.append(line);
}

In addition, the final empty line created by the trailing newline at the end of the file (inserted automatically by some text editors) is removed, if it exists. This may only be performed if the file is non-empty, to avoid out-of-bounds indexing.

+ Split file at each newline:

if ((buffer.items.len > 0) and std.mem.eql(u8, buffer.items[buffer.items.len - 1], "")) {
    _ = buffer.pop();
}

We define the configuration command searching function, which returns a list containing the segments of the split format string. The function will return from within the for loop if the declaration is found, otherwise an error is reported.

Configuration searching function:

pub fn get_conf(lines: [][]const u8, key: []const u8, alloc: Allocator) ![][]const u8 {
    for (lines) |line| {
        if (std.mem.startsWith(u8, line, key)) {
            return try fmt_conf(line, key, alloc);
        }
    }
    log(.err, "config declaration '{s}' not found", .{std.mem.trimRight(u8, key, " \t")});
    return error.NotFound;
}

(Auxiliary formatting function)

If the declaration is found, its contained format string is split along instances of the section name escape sequence, and each substring has its instances of the newline escape sequence replaced with a literal newline.

Auxiliary formatting function:

fn fmt_conf(line: []const u8, key: []const u8, alloc: Allocator) ![][]const u8 {
    var buffer = std.ArrayList([]const u8).init(alloc);
    defer buffer.deinit();

    var iterator = std.mem.split(u8, line[(key.len)..], kc_esc);
    while (iterator.next()) |str| {
        try buffer.append(try std.mem.replaceOwned(u8, alloc, str, kc_nl, "\n"));
    }

    return buffer.toOwnedSlice();
}

We define the code section searching function, which returns the index (into the section list) of the first section with a matching name, or returns an error if none exist.

Section searching function:

fn search(list: []Section, name: []const u8) !usize {
    for (list) |section, index| {
        if (std.mem.eql(u8, section.name, name)) return index;
    }
    log(.err, "section '{s}' not found", .{name});
    return error.NotFound;
}

Parsing

We first define a function which, for a given line, determines whether it consists of a formatting command, and which type of command it contains. This is done in order to enable the use of switch statements in later functions using this routine.

Command type detection function:

const CommandType = enum { start, add, end, ref, none };

fn command_type(line: []const u8) CommandType {
    if (std.mem.startsWith(u8, line, k_start)) {
        return .start;
    } else if (std.mem.startsWith(u8, line, k_add)) {
        return .add;
    } else if (std.mem.eql(u8, line, k_end)) {
        return .end;
    } else if (std.mem.startsWith(u8, std.mem.trimLeft(u8, line, " \t"), k_ref)) {
        return .ref;
    } else {
        return .none;
    }
}

We then define the parsing functions, consisting of the main parse function which builds the list of Sections, and its auxiliary parse_code subroutine which builds the contents of each CodeSection.

Parsing functions:

pub fn parse(lines: [][]const u8, alloc: Allocator) ![]Section {
    var sections = std.ArrayList(Section).init(alloc);
    defer sections.deinit();

    (Main parsing routine)

    return sections.toOwnedSlice();
}

fn parse_code(lines: [][]const u8, index: u32, alloc: Allocator) !CodeReturn {
    var content = std.ArrayList(Content).init(alloc);
    defer content.deinit();

    (Code parsing subroutine)

    return CodeReturn{ .content = content.toOwnedSlice(), .index = i + 1 };
}

The latter function takes as arguments the list of lines and the allocator similarly to the main function, but it is also passed the index of the current line being processed, and returns the line at which the main function should resume parsing after the code section is parsed. It thus returns a struct consisting of the contents of the code section and the next line number index, as follows.

+ Parsing functions:

const CodeReturn = struct {
    content: []const Content,
    index: u32,
};

The main parsing routine iterates over the list of lines, adding code sections where they occur, and otherwise ignoring text sections. If a section end command is encountered in the absence of a preceding starting command, an error is returned.

Main parsing routine:

var i: u32 = 0;
while (i < lines.len) {
    const line = lines[i];
    switch (command_type(line)) {
        .start  => {
            (Add new section)
        },
        .add    => {
            (Append to section)
        },
        .end    => {
            log(.err, "line {d}: unexpected section end", .{i + 1});
            return error.UnexpectedEnd;
        },
        else    => {
            i += 1;
        },
    }
}

To add a new section, the name (consisting of everything after the starting token) is first retrieved from the starting command. Then the code parsing subroutine is called, beginning at the line after the starting command, and it returns the resulting code section (section.content) and the next line at which to resume parsing (section.index). The code section is appended to the section list, and the parsing routine continues at the provided index.

Add new section:

const name = line[(k_start.len)..];
log(.debug, "({d}) starting section '{s}'", .{ i + 1, name });

const section = try parse_code(lines, i + 1, alloc);
try sections.append(.{ .name = name, .content = section.content });

log(.debug, "({d}) ending section '{s}'", .{ section.index, name });
i = section.index;

To append to an existing section, the section name and the code section contents to be appended are retrieved as above. The index of the section is located, along with its address within the section list. Next, the new contents of the section are created by concatenating the old contents with the newly parsed code section contents. The section list is then updated to point to the new section contents, and the parsing routine continues.

Append to section:

const name = line[(k_add.len)..];
log(.debug, "({d}) appending to section '{s}'", .{ i + 1, name });

const section = try parse_code(lines, i + 1, alloc);
const index = try search(sections.items, name);
const old = &sections.items[index];
const new = try std.mem.concat(alloc, Content, &[_][]const Content{ old.*.content, section.content });
old.*.content = new;

log(.debug, "({d}) ending section '{s}'", .{ section.index, name });
i = section.index;

The code parsing subroutine iterates over the list of lines similarly to the main routine. If a starting or appending command is encountered (lacking a matching ending command), an error is raised. Reference commands may be preceded with any amount of whitespace. The loop exits upon encountering an ending command. Otherwise, the line is appended as a literal element.

Code parsing subroutine:

var i = index;
while (i < lines.len) {
    const line = lines[i];
    switch (command_type(line)) {
        .start, .add    => {
            log(.err, "line {d}: unexpected section start", .{i + 1});
            return error.UnexpectedStart;
        },
        .ref    => {
            (Add reference)
        },
        .end    => {
            break;
        },
        else    => {
            (Add literal range)
        },
    }
}

To add a reference, the name of the referenced section is retrieved, consisting of the characters following the leading whitespace and the command token. The resulting string is appended to the section contents list, and the parser continues at the next line.

Add reference:

const ref_name = std.mem.trimLeft(u8, line, " \t")[(k_ref.len)..];
try content.append(.{ .reference = ref_name });
log(.debug, "({d}) \tappended reference '{s}'", .{ i + 1, ref_name });
i += 1;

To add a literal range, the parser either updates the end index of the previous literal element, or creates a new literal element if the last element added is a reference. This action of switching on the previous section element must only occur if the section contents list is non-empty, in order to prevent out-of-bounds indexing. Otherwise, the parser unconditionally appends a new literal element to the list. After either case, parsing continues at the next line.

Add literal range:

if (content.items.len > 0) {
    switch (content.items[content.items.len - 1]) {
        .literal => |*range| {
            range.*.end = i;
        },
        .reference => {
            try content.append(.{ .literal = .{ .start = i, .end = i } });
            log(.debug, "({d}) \tappending literal", .{i + 1});
        },
    }
} else {
    try content.append(.{ .literal = .{ .start = i, .end = i } });
    log(.debug, "({d}) \tappending literal", .{i + 1});
}
i += 1;

Code Generation

We define the source code generation procedure which is split into two functions, consisting of a wrapper function which begins code generation at (the index of) the top-level section, and the main procedure which iterates over the current section contents, recursively resolving section references and appending literal elements to the list of source code lines.

Code generation functions:

pub fn codegen(lines: [][]const u8, list: []Section, alloc: Allocator) ![][]const u8 {
    const root = try search(list, k_root);
    return try codegen_main(lines, list, root, 0, alloc);
}

fn codegen_main(lines: [][]const u8, list: []Section, index: usize, depth: u8, alloc: Allocator) anyerror![][]const u8 {
    var buffer = std.ArrayList([]const u8).init(alloc);
    defer buffer.deinit();

    const section = list[index];
    log(.debug, "generating section '{s}'", .{section.name});
    for (section.content) |content| switch (content) {
        .literal => |range| {
            (Append literal range)
        },
        .reference => |name| {
            (Resolve reference)
        },
    };

    log(.debug, "ending section '{s}'", .{section.name});
    return buffer.toOwnedSlice();
}

To append a literal range, the range of lines is simply appended to the buffer.

Append literal range:

log(.debug, "adding literal range {d}-{d}", .{ range.start + 1, range.end + 1 });
try buffer.appendSlice(lines[(range.start)..(range.end + 1)]);

To resolve a section reference, the function must first check whether the current recursion depth has exceeded the configured limit, and return an error if this occurs. Otherwise, the index of the referenced section is retrieved, its contents are recursively parsed (with an incremented recursion depth), and the resulting source code lines are appended to the buffer.

Resolve reference:

if (depth > dereference_max) {
    log(.err, "section dereferencing recursion depth exceeded (max {d})", .{dereference_max});
    return error.DereferenceLimit;
}
const ref = try search(list, name);
const code = try codegen_main(lines, list, ref, depth + 1, alloc);
try buffer.appendSlice(code);

Text Generation

Finally, we define the text generation function which iterates over the list of lines and produces the literate program text to be passed to an external document processor. In order to keep track of the name of the code section currently being formatted at any given point, the variable current_name is continually updated to contain the current name string. Configuration declarations are skipped, and lines which do not contain any formatting commands are appended as they are.

Text generation function:

pub fn textgen(lines: [][]const u8, alloc: Allocator) ![][]const u8 {
    var buffer = std.ArrayList([]const u8).init(alloc);
    defer buffer.deinit();

    (Process configuration declarations)

    var current_name: []const u8 = undefined;
    for (lines) |line| {
        if (    std.mem.startsWith(u8, line, kc_start)
        or  std.mem.startsWith(u8, line, kc_add)
        or  std.mem.startsWith(u8, line, kc_end)
        or  std.mem.startsWith(u8, line, kc_ref)) {
            continue;
        } else switch (command_type(line)) {
            .start  => {
                (Format starting command)
            },
            .add    => {
                (Format appending command)
            },
            .ref    => {
                (Format reference command)
            },
            .end    => {
                (Format ending command)
            },
            else    => {
                try buffer.append(line);
            },
        }
    }
    
    return buffer.toOwnedSlice();
}

The formatting strings given by each configuration declaration are first retrieved. If the declaration of the format string for the section appending command is omitted, the format string for the section starting command is used in its place.

Process configuration declarations:

const conf_start = try get_conf(lines, kc_start, alloc);
const conf_add = get_conf(lines, kc_add, alloc) catch conf_start;
const conf_end = try get_conf(lines, kc_end, alloc);
const conf_ref = try get_conf(lines, kc_ref, alloc);

To process a section starting command, the current section name is updated, and the contents of the corresponding formatting command (that is, the segments of the split formatting string) are interspersed with copies of the current section name. The resulting string is then appended to the buffer.

Format starting command:

current_name = line[(k_start.len)..];
try buffer.append(try std.mem.join(alloc, current_name, conf_start));

Processing a section appending command is performed similarly.

Format appending command:

current_name = line[(k_add.len)..];
try buffer.append(try std.mem.join(alloc, current_name, conf_add));

To process a reference command, the index of the reference command keyword is first extracted. Then the formatted reference string is created, to which the reference command line’s leading whitespace is prepended (to preserve indentation).

Format reference command:

const start = std.mem.indexOf(u8, line, k_ref).?;
const ref = try std.mem.join(alloc, line[(start + k_ref.len)..], conf_ref);
try buffer.append(try std.mem.concat(alloc, u8, &[_][]const u8{ line[0..start], ref }));

Processing a section ending command is performed similarly to the starting and appending commands, however it does not require updating the current section name.

Format ending command:

try buffer.append(try std.mem.join(alloc, current_name, conf_end));