String parsing in Swift

Almost every program on the planet has to deal with strings one way or another, since text is so fundamental to how we both communicate and represent various forms of data. But handling and parsing strings in a way that’s both robust and efficient can, at times, be really difficult. While some strings come in a very strict and computer-friendly format, such as JSON or XML, other strings can be much more chaotic.

This week, let’s take a look at various ways to parse and extract information from such strings, and how different techniques and APIs will yield a different set of trade-offs.

String and its friends

In some ways, Swift has developed a reputation for being a bit difficult to work with when it comes to string parsing. While it’s true that Swift’s String implementation doesn’t offer the same amount of convenience as many other languages do (for example, you can’t simply randomly access a given character using an integer, like string[7]), it does make it easier to write correct string parsing code.

Because while it’s nice to be able to randomly access any given character in a string based on its perceived position, the multi-lingual (or perhaps emoji-lingual?) world that we live in today makes such APIs very error prone, since the way a character is represented in a text UI is in many cases very different from how it’s actually stored within a string value.

In Swift, a string is composed of a collection of Character values, stored using the UTF-8 encoding. That means that if we iterate over a string (for example using a for loop), each element will be a Character — which might be a letter, an emoji, or some other form of character. To identify groups of characters (such as letters or numbers), we can use a CharacterSet, which can be passed to several different APIs on String and its related types.

Tokenizing usernames

Let’s say that we’re working on an app that lets several different users collaborate on a document, and that we want to implement a feature that lets users mention other people using a Twitter-like @mention syntax.

Identifying what users that were mentioned within a given string is a task that’s actually quite similar to what the Swift compiler needs to do when identifying the various parts within a string of code — a process known as lexing or tokenizing — only that our implementation will be orders of magnitude simpler, since we only need to look for one kind of token.

An initial implementation might look something like this — a computed property on String, in which we split the string up based on @ characters, drop the first element (since it’ll be the text before the first @-sign), and then compactMap over the result — identifying strings of non-empty letter-only characters:

extension String {
    var mentionedUsernames: [String] {
        let parts = split(separator: "@").dropFirst()

        // Character sets may be inverted to identify all
        // characters that are *not* a member of the set.
        let delimiterSet = CharacterSet.letters.inverted

        return parts.compactMap { part in
            // Here we grab the first sequence of letters right
            // after the @-sign, and check that it’s non-empty.
            let name = part.components(separatedBy: delimiterSet)[0]
            return name.isEmpty ? nil : name
        }
    }
}

The above implementation is quite simple, and makes use of some really nice Swift features — such as modifying collections, using compactMap to discard nil values, and so on. But it does have one problem — it requires three iterations, one to split the string based on @ characters, one to iterate over all those parts, and then one to split each part based on non-letter characters.

While each iteration is smaller than the one before it (so our algorithm’s complexity is not quite O(3N)), multiple iterations will most often result in some form of bottleneck as the input dataset grows. That could become a problem in our case, since we’re planning on applying this algorithm to documents of any size (maybe some users will work on a book together using our app, who knows?), so let’s see if we can do something to optimize it.

Rather than splitting our string up into components, and then iterating over those components, let’s instead make a single pass through our string — by iterating over its characters. While this’ll require a bit more manual parsing code, it’ll enable us to reduce our algorithm into one single iteration — like this:

extension String {
    var mentionedUsernames: [String] {
        // Setting up our state, which is any partial name that we’re
        // currently parsing, and an array of all names found.
        var partialName: String?
        var names = [String]()

        // A nested parsing function, that we’ll apply to each
        // character within the string.
        func parse(_ character: Character) {
            if var name = partialName {
                guard character.isLetter else {
                    // If we encounter a non-letter character
                    // while parsing a name, it means that the
                    // name is finished, and we can add it to
                    // our array (if non-empty):
                    if !name.isEmpty {
                        names.append(name)
                    }

                    // Reset our state, and parse the character
                    // again, since it might be an @-sign.
                    partialName = nil
                    return parse(character)
                }

                name.append(character)
                partialName = name
            } else if character == "@" {
                // Set an empty state, to signal to our above
                // code that it’s time to start parsing a name.
                partialName = ""
            }
        }

        // Apply our parsing function to each character
        forEach(parse)

        // Once we’ve reached the end, we’ll make sure to
        // capture any name that was previously found.
        if let lastName = partialName, !lastName.isEmpty {
            names.append(lastName)
        }

        return names
    }
}

Note that the above isLetter API on Character was added in Swift 5.

While the above implementation is much more complex than our initial one — it is about twice as fast (on average), which presents us with our first clear trade-off: do we opt for a simpler solution at the potential cost of performance — or do we choose to maintain a more complex, and also more efficient, algorithm?

Detecting multiple tokens

The discussion of which trade-off to accept becomes even more interesting as our list of requirements grows. Let’s say that after successfully rolling out the mentioning feature to our users, we start getting requests to also add support for hashtags, and that we decide to do just that.

Since detecting both usernames and hashtags is exactly the same problem (only the leading character is different — @ vs #), it makes sense to use the same implementation to detect both. To make that happen, let’s first define a Symbol type, which we’ll use to represent either a mention or a hashtag:

struct Symbol {
    enum Kind { case mention, hashtag }

    let kind: Kind
    var string: String
}

While we could’ve used an enum with associated string values instead — since both mentions and hashtags share the same structure, it’ll make our algorithm a bit simpler if we use a struct. However, to enable Symbol to be used in an enum-like fashion, we can also add some static factory methods to make it easy to construct values using dot syntax:

extension Symbol {
    static func mention(_ string: String) -> Symbol {
        return Symbol(kind: .mention, string: string)
    }

    static func hashtag(_ string: String) -> Symbol {
        return Symbol(kind: .hashtag, string: string)
    }
}

With the above in place, let’s now update our mentionedUsernames algorithm from before to detect symbols, rather than just usernames:

extension String {
    var symbols: [Symbol] {
        var partialSymbol: Symbol?
        var symbols = [Symbol]()

        func parse(_ character: Character) {
            if var symbol = partialSymbol {
                guard character.isLetter else {
                    if !symbol.string.isEmpty {
                        symbols.append(symbol)
                    }

                    partialSymbol = nil
                    return parse(character)
                }

                symbol.string.append(character)
                partialSymbol = symbol
            } else {
                // Here’s the only real difference compared to
                // the previous version, since we’ll now decide
                // what kind of symbol we’re parsing based on
                // its leading character:
                switch character {
                case "@":
                    partialSymbol = .mention("")
                case "#":
                    partialSymbol = .hashtag("")
                default:
                    break
                }
            }
        }

        forEach(parse)

        if let lastSymbol = partialSymbol, !lastSymbol.string.isEmpty {
            symbols.append(lastSymbol)
        }

        return symbols
    }
}

With the above change, the difference between our initial string-splitting-based implementation and our latest algorithm grows even larger — since if we were to tokenize both usernames and hashtags by splitting strings, we’d require 6 different iterations (2 times 3) — two of which would be complete passes through the original string.

However, it would be nice if we could find some sort of middle ground between the simplicity of our initial implementation and the power of our manual algorithm. One way to do that could be to introduce an abstraction, that separates the tokenization part of our algorithm from the logic of actually dealing with any tokens found.

To do that, let’s move the bulk of our most recent algorithm to a tokenize method, that accepts a dictionary of handlers, hashed based on which character they’re handling — like this:

extension String {
    func tokenize(using handlers: [Character : (String) -> Void]) {
        // We no longer have to maintain an array of symbols,
        // but we do need to keep track of both any currently
        // parsed symbol, as well as which handler its for.
        var parsingData: (symbol: String, handler: (String) -> Void)?

        func parse(_ character: Character) {
            if var data = parsingData {
                guard character.isLetter else {
                    if !data.symbol.isEmpty {
                        data.handler(data.symbol)
                    }

                    parsingData = nil
                    return parse(character)
                }

                data.symbol.append(character)
                parsingData = data
            } else {
                // If we have a handler for a given character,
                // then we’ll parse it.
                guard let handler = handlers[character] else {
                    return
                }

                parsingData = ("", handler)
            }
        }

        forEach(parse)

        if let lastData = parsingData, !lastData.symbol.isEmpty {
            lastData.handler(lastData.symbol)
        }
    }
}

Not only does the above clean our implementation up a bit, it also gives us much more flexibility — since we can now reuse the same tokenization algorithm within many different contexts, and choose how each token should be handled at the call site.

We’re now able to reduce our symbol parsing code from before to this:

extension String {
    var symbols: [Symbol] {
        var symbols = [Symbol]()

        tokenize(using: [
            "@": { symbols.append(.mention($0)) },
            "#": { symbols.append(.hashtag($0)) }
        ])

        return symbols
    }
}

Very nice and clean! 👍 But, just like before, there’s a set of trade-offs involved. While abstractions are a great way to hide complex logic behind a much nicer API — they’re not free. In fact, our latest version requires about 40% more time to run compared to when the algorithm was inlined into the symbols property. However, it’s still about twice as fast as doing the same thing by splitting strings, so it might actually be that nice middle ground that we were looking for.

Scanning for tokens

So far, we’ve been mostly comparing manually iterating through a string’s characters versus splitting it into components — but, just like with many other things, there are more options to consider.

One such option is using Foundation’s Scanner type to continuously scan the string to identify the tokens we’re looking for. Like with our last implementation from before, it enables us to rely on an abstraction (this time provided by Apple) to handle much of our algorithm’s “heavy lifting”. Here’s what such an implementation could look like:

extension String {
    var symbols: [Symbol] {
        let scanner = Scanner(string: self)
        let symbolSet = CharacterSet(charactersIn: "@#")
        var symbols = [Symbol]()
        var symbolKind: Symbol.Kind?

        // Scanner doesn’t expose a sequence-like API, but rather
        // requires us to inspect its current state to decide
        // when the iteration is over.
        while !scanner.isAtEnd {
            if let kind = symbolKind {
                symbolKind = nil

                // Since Scanner is actually just a Swift interface
                // to the NSScanner Objective-C class, it requires
                // us to pass it an NSString pointer, which it’ll
                // write the result into.
                var result: NSString?
                scanner.scanCharacters(from: .letters, into: &result)

                guard let string = result else {
                    continue
                }

                symbols.append(Symbol(kind: kind, string: string as String))
            } else {
                // Here we scan until we’ve found either an @ or
                // a # character, and then consume that character
                // while assigning a symbol kind to parse.
                scanner.scanUpToCharacters(from: symbolSet, into: nil)

                if scanner.scanString("@", into: nil) {
                    symbolKind = .mention
                } else if scanner.scanString("#", into: nil) {
                    symbolKind = .hashtag
                }
            }
        }

        return symbols
    }
}

While the “Objective-C-ness” of Scanner might make our code look a bit less ”Swifty”, it is a valid option for this kind of problem — and its performance characteristics is on-par with our manual iteration-based implementation from before. It could also be argued that using Scanner produces slightly more readable code, since its APIs explicitly state that we’re scanning for a given string within each part of the algorithm, but that’s going to be highly subjective.

The art of being lazy

Let’s take a look at one final venue for optimization — using lazy collections. Up to this point, all implementations we’ve explored have had one thing in common — they all parse the entire string immediately. While that’s completely fine for use cases that will consume all results immediately as well, we could potentially optimize things further by instead performing our parsing in a more lazy manner.

Like we took a look at in Swift sequences: The art of being lazy, delaying computing an element in a sequence until it’s actually needed can sometimes give us a performance boost — especially if the caller will only consume a subset of our sequence.

Let’s update our last Scanner-based implementation to be lazily executed. To do that, we’ll wrap our algorithm in an AnySequence and AnyIterator, which enables us to form lazy sequences without requiring us to implement a new type from the ground up:

extension String {
    var symbols: AnySequence<Symbol> {
        // Since our character set is a constant, we can compute
        // it immediately, outside of our iteration closure.
        let symbolSet = CharacterSet(charactersIn: "@#")

        // AnySequence enables us to use closures to define both
        // our sequence and its underlying iterator.
        return AnySequence<Symbol> { () -> AnyIterator<Symbol> in
            let scanner = Scanner(string: self)
            var symbolKind: Symbol.Kind?

            // We’ve refactored our algorithm a bit to instead
            // use a nested function, that will be called by the
            // iterator to retrieve the next element (or nil, once
            // we’ve reached the end of the sequence).
            func iterate() -> Symbol? {
                guard !scanner.isAtEnd else {
                    return nil
                }

                guard let kind = symbolKind else {
                    scanner.scanUpToCharacters(from: symbolSet, into: nil)

                    if scanner.scanString("@", into: nil) {
                        symbolKind = .mention
                    } else if scanner.scanString("#", into: nil) {
                        symbolKind = .hashtag
                    }

                    return iterate()
                }

                symbolKind = nil

                var result: NSString?
                scanner.scanCharacters(from: .letters, into: &result)

                guard let string = result else {
                    return iterate()
                }

                return Symbol(kind: kind, string: string as String)
            }

            return AnyIterator(iterate)
        }
    }
}

With the above change, we’ll now get quite a significant performance boost for code like this, which only iterates through our symbols sequence up until a given point:

let firstHashtag = string.symbols.first { $0.kind == .hashtag }

And that presents us with the final trade-off for this article — are we willing to accept an extra bit of complexity (and possibly slightly more difficult debugging), in order to make shorter iterations more performant?

Conclusion

Like when writing any kind of algorithms, exactly what implementation that will be the most appropriate heavily depends on both what sort of problem we’re dealing with, how large the datasets we’re planning to run it on are, and how the result will be consumed.

There are of course also a lot of other ways to parse strings in Swift that this article didn’t cover — such as using regular expressions, or diving down to parse things on the underlying Unicode level — but hopefully it has given you an overview of some of the tools and techniques that are at our disposal, and the various trade-offs that they all come with.

My personal approach is almost always to start with the simplest implementation possible, and while continuously measuring the results and performance characteristics of my algorithms, scale up as needed — just like in this article. We’ll also take a closer look at how exactly to measure our code against various kinds of performance metrics in an upcoming article.

What do you think? Do you have a favorite way of doing string parsing in Swift, or will you try one of the techniques from this article the next time you’re faced with this sort of a problem? Let me know — along with your questions, comments and feedback — on Twitter @johnsundell, or by contacting me.

Thanks for reading! 🚀

Integration tests in Swift