Start of a journey: Building a programming language(Vader)

Yea… Something’s gotten into me, and, somehow, I’m building a programming language and a compiler. Later on, there will also be an IDE, VS Code plugin, and frameworks built arount it.

This one won’t be long, but it’s the start of the long journey that I’m taking(or at least trying to). I will write down what I have learned and thoughts along the way. I have decided to call my programming language ‘Vader’ because my next language will be called Kylo and will finish what this one has started.

Overview

A programming language can be in a lot of forms. Object oriented languages and functional languages seems to put people in different programming paradigms. There are also compiled to assembly, compiled to bycode and interpreted languages. However, at the end of the day, it’s a big string with a pattern, so we need a compiler. The compiler need to read in the strings, validate it, turn it into a data srtucture that one can programmatically understand and then translate it into code that can be run.

Obviously compiling code needs to be done in multiple steps. It being complex is not the only reason, since a monolithic program (assuming a well written one) always have the best performance. The main reason is that one always need to think ahead. A language is nothing without it’s ecosystem, and a syntax highlight extension in the most used code editor is something people take for granted these days from a new programming language. Information from the early steps where the compiler parses the language can be used for syntax highlighting and code auto complete, but that’s for down the line.

Token

Everything can be done in a lot of if-statements, but that’s not why we are here. Token is a more structured way to represent different kinds of key workds. Having a structure allows us to put more information in it than just the string itself. There are identifiers that represent variable or function names, value literals that you normally find on the right side of an assignment for integers or strings, operators like plus or minus sign and so much more.

The information the tokens contain can help the compiler decide what to do with it, so it’s critical to get it right. You need to know the kind of the token. Is it a identifier? Is it a operator? If it’s a value of some sort, you also need to store the value. As long as the token is valid, the information it contains need to be 100% correct. Otherwise the parser will not know what to do with it.

public SyntaxToken(SyntaxKind kind, int position, string text, object value)
{
    Kind = kind;
    Position = position;
    Text = text;
    Value = value;
}

Tokens are like the building block of a language. In Engligh, they will be subjects, objects or verbs. They are all blocks of letters, but each of them play different roles. Identifiers express the identities of variables and functions, so they need to be at the start of an expression. Operators connect tokens together, so they need to be between them. These rules are called grammar. Similar to how grammar works in English, they dictates how the tokens are layed out in a language. In other words, don’t mess up the syntax. But how exactly do lines of strings become a collection of token? Well, that’s where lexers come in.

Lexer

How lexers got their name is beyond me. It’s probably short for Lex Luthor. Any programmer who had written a parser before had already wrote a lexer. It is the piece of code that you do a lot of crazy switch statements and if statements to match the word with something you are expecting to show up at that position. Yep, that’s exact what a lexer is.

In a compiler, the lexer does exactly that. It reads in the string, and, based on the “grammar”, it will start parsing the string for tokens. If a piece of string is at a position the lexer is not expecting it to be, the compiler throws an error. This is exactly why you see compiler error often contains the word “unexpected”. It is because the person who wrote the compiler did not expect such a value to be at such a position.

Think of a lexer as the tokens’ iterator. You need it to get to the next token, or any other operations on the tokens. It not only parses the string into tokens but also manages the tokens so the token parser later can use it to extract information out of tokens.

public SyntaxToken NextToken()
{
    if (_position >= _text.Length)
    {
        return new SyntaxToken(SyntaxKind.EndOfFileToken, _position, "\0", null);
    }
    if (char.IsDigit(Current))
    {
        var start = _position;

        while (char.IsDigit(Current))
            Next();
                
        var length = _position - start;
        var text = _text.Substring(start, length);
        if (!int.TryParse(text, out var value))
        {
            _diagnostics.Add($"Error: The number {_text} is not a valid Int32");
        }
        return new SyntaxToken(SyntaxKind.NumberToken, start, text, value);
    }

    if (char.IsWhiteSpace(Current))
    {
        var start = _position;

        while (char.IsWhiteSpace(Current))
            Next();
                
        var length = _position - start;
        var text = _text.Substring(start, length);
        return new SyntaxToken(SyntaxKind.WhitespaceToken, start, text, null);
    }

    if (Current == '+')
        return new SyntaxToken(SyntaxKind.PlusToken, _position++, "+", null);
    else if (Current == '-')
        return new SyntaxToken(SyntaxKind.MinusToken, _position++, "-", null);
    else if (Current == '*')
        return new SyntaxToken(SyntaxKind.StarToken, _position++, "*", null);
    else if (Current == '/')
        return new SyntaxToken(SyntaxKind.SlashToken, _position++, "/", null);
    else if (Current == '(')
        return new SyntaxToken(SyntaxKind.OpenParenthesisoken, _position++, "(", null);
    else if (Current == ')')
        return new SyntaxToken(SyntaxKind.CloseParenthesisToken, _position++, ")", null);
            
    _diagnostics.Add($"Error: Bad char input: '{Current}'");
    return new SyntaxToken(SyntaxKind.BadToken, _position++, _text.Substring(_position - 1, 1), null);
}

Expression Tree

As good as tokens are for containing more information, it’s still in the sequencial order that the string came in with. In order for the compiler to understand what the logic is, we need more of a top down view of the tokens. We need a data structure that can tell which part is high level and that we should look first. Naturally, that’s going to be a tree structure. By defining what the tree looks like, we are essentially defining the syntax of this programming language. I’m still trying to figure this part out, but in the mean time, you can check our my other articles.

Follow this series for my journey building a programming language.

Like my content?

Buy me a coffee

$2.99

Advertisements

One thought on “Start of a journey: Building a programming language(Vader)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s