Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: I'm making a dynamic language in Rust (github.com/mwerezak)
184 points by harpiaharpyja on April 24, 2022 | hide | past | favorite | 45 comments
An implementation of a dynamic programming language in Rust. Includes: Parser/Compiler, REPL, Virtual Machine, Bytecode Disassembler

This started out as a learning project to teach myself Rust. It has grown into a decently substantial piece of software and I've learned quite a bit in the process!

Some neat things:

+ A garbage collector that can store dynamically sized types without any double-indirection (i.e. I have my own Box implementation with manual alloc/dealloc)

+ The smart pointer used to reference GCed data is a thin pointer. The ptr metadata needed for DSTs is stored in the GC allocation itself, so that the GC smart pointer is just a single usize wide. This allows me to keep the core value enum Variant down to 16 bytes (8 bytes for data, the enum discriminant, and some padding).

+ The GC also supports weak references!

+ Statically dispatched type object model using a newtype wrapper and Rust's declarative macros. Ok, what that means is that I have a MetaObject trait that I can use to easily add new data types and define the behavior for specific types. Similar idea to Python's PyTypeObject though very different in implementation. However, I don't resort to dynamic dispatch or trait objects despite working with dynamically type data. Instead, I have a newtype wrapper over the core value enum Variant that statically dispatches to each of the enum branches! And then a few macros that minimize the boilerplate required if I want to add a new branch to Variant or a new method to MetaObject (just a single line in each case).

+ Different string representations! This was inspired by the flexstr crate. Strings that are short enough to fit inside a Variant are "inlined" directly in the value. Longer strings are either GCed or interned in a thread-local string table. All identifiers are interned.

+ An efficient implementation of closures inspired by Lua's upvalues.

The language is still pretty WIP. I'm planning to add an import system, a small standard library, and a few other things

(Yes, the name might not be the best, being also used by a well-known ReST docs generator, I'll take suggestions. I do like the name though, both as a reference to the mythological creature and the cat :D)



Very interesting. I'm curious how you solved the following interaction, which touches on both Rust's borrow checker and a classic footgun for GC'd languages:

You have a GCPtr to some cell on the heap, and you want to dereference the pointer to modify the cell. But the GC also needs to dereference the cell, e.g. to update object references after moving. So while a GCPtr is dereferenced, you must never trigger a collection, which means no allocation. If you do, you violate Rust's "only one &mut" rule, and also risk a dangling pointer. How do you enforce this?

One way you could enforce this is to require a shared reference to the GC to dereference any GCPtr. Instead of `cell.count += 1` you would write `cell.deref(gc).count += 1`. This works but is verbose.

Another way you could enforce it is dynamically, by setting a value in a cell whenever it is dereferenced; but this incurs a runtime cost.

How does Sphinx solve this?


Not OP, but I recently wrote about solving this problem in my interpreter[1]. Essentially using a combination of LCell (compile time interior mutability) and taking a shared reference to the GC as you suggest.

[1] https://coredumped.dev/2022/04/11/implementing-a-safe-garbag...


It works for Sphinx because the scope of the GC is restricted to Sphinx code. The Sphinx language runtime itself is typical Rust code - no GC used. So it is fairly straightforward to ensure that GCPtrs are never dereferenced during a collection cycle. Essentially, collection only ever happens between instructions.

If we were writing a GC for general Rust code then this and other issues around ensuring values are rooted would become much bigger problems. For example, the rust-gc crate has to deal with these. Just to be safe I took a page out of rust-gc's book and implemented a guard to ensure that the runtime will at least panic if I make a mistake and that happens.


> This was inspired by the flexstr crate

I'm honored my string crate was able to inspire. Kinda surprised to see it mentioned as it isn't all that popular (yet? I hope...)


Another programming language that is implemented in Rust is Gleam[1]. I think it's a really slick language, but it doesn't implement GC or it's own VM but is more of a source to source transpiler.

[1]: https://github.com/gleam-lang/gleam


For a longer list of languages written in Rust, see https://github.com/alilleybrinker/langs-in-rust


Rhai and Gluon are the other two big ones. I've used Rhai for a configuration system and it was pretty good. Very very easy to integrate. Unfortunately it is also dynamically typed and doesn't support type annotations yet, but I reckon they'll add that eventually.

Gluon is statically typed but it's also functional and a bit weird.


So Gleam moved to Rust, interesting. I remember it was supposed to be a type safe language for Erlang, because while Elixir exists, it's not type safe, one of the main reasons I didn't start using Elixir, much as I like the Erlang philosophy.


It still targets Erlang—but the compiler's written in Rust (from the first v0.1 release in 2019).


It also targets JavaScript now, and maybe eventually Webassembly.


Tracking issue for native code is here: https://github.com/gleam-lang/gleam/issues/109 unlikely they will directly compile to WASM.

A similar language to gleam built on WASM is grain: https://github.com/grain-lang/grain


Is there a good guide somewhere, that explains all the different "types" of languages? When I read dynamic/static etc, I often have NFI what that means, something better than Wikipedia, perhaps.


There’s two main attributes of languages as far as typing goes:

Static vs dynamic: when a variable is declare does the compiler know and enforce the type? e.g. can I declare a variable as a string and then assign it an int to it? If you can then you have a dynamic language. If you can’t then you have a static language. Examples of static are Java, Rust, and C++. Examples of dynamic are Python, JavaScript, and Ruby.

The other attribute is weak vs strong typing. Weakly typed languages will coerce types where appropriate. for example if I try to compare “1” with 1 a runtime error occur for a strongly typed language, a weakly typed language will coerce “1” -> 1 and return “true”. Examples of strongly typed languages are Ruby and Java. Examples of weakly typed languages are JavaScript.

Mostly statically typed languages are strongly typed.

My definitions probably aren’t rigorous, but I think they’re good enough.


One way to get at this: making your own language will show you not only what each thing means, but also what the underlying mechanics are!

Crafting Interpreters is a great guide on this: http://www.craftinginterpreters.com


> One way to get at this: making your own language will show you not only what each thing means, but also what the underlying mechanics are!

True, but if everyone who asked any question about something had to implement those things to even understand the basics of them, we'd never get anything done :)


For what it's worth, most CS degrees do involve building a programming language.

You don't need a degree to be a developer (I don't have one) but I think it's a good indication that it is actually reasonable to think that every developer could implement a programming language. (And I mean developer as in someone who writes code professionally as their main job, not necessarily sysadmins or designers or whatnot).

To reiterate, I think many more people than think they can can develop their own language.


It's a big topic. Not only static/dynamic, but also things like structural vs nominal typing.

In this case, it comes down to whether the compiler tracks the "type" of every value when compiling (static typing) or whether types are stored somewhere in memory that is read from when your program is actually running.


> something better than Wikipedia, perhaps

You can probably ask a large language model like GPT-N to generate a summary.


You might be interested in these articles if you want to experiment with alternative approaches to the intrusive linked-list and are OK with using object pools. Very handy if you want to have cheap alloc/free and want good memory locality for updating the objects.

https://slideplayer.com/slide/4470263/

https://gamesfromwithin.com/managing-data-relationships


Thanks! I started with an intrusive linked list because I just wanted to get something working but replacing it with something else has been on my mind for the same reasons you mention.


> (Yes, the name might not be the best, being also used by a well-known ReST docs generator, I'll take suggestions. I do like the name though, both as a reference to the mythological creature and the cat :D)

Maybe another cat and mythology related name? If you want to stick with the Egyptian theme, maybe something do to with Bastet: https://en.wikipedia.org/wiki/Bastet


Thanks!


Actual clickable link to Github: https://github.com/mwerezak/sphinx-lang


Did you encounter any difficulties writing this with safe Rust? When or where did you find it necessary to use unsafe code in your implementation?


The only situations that _require_ unsafe are FFI (talking with C libs) and bare-metal access (assembler, embedded, drivers, etc.). An interpreter is totally in the clear on these subjects and should not have anyone reaching for unsafe.

Of course, unsafe _can_ also be used to implement code that takes shortcuts beyond what the borrow checker may be able to handle. But if you're building an interpreter, speed is probably not a primary concern anyway.


> An interpreter is totally in the clear on these subjects and should not have anyone reaching for unsafe.

This is true for a toy interpreter. You don’t “need” unsafe. But anything other then a toy interpreter has many reasons to reach for unsafe: garbage collection, flexstr, object dispatch, tagged pointers, etc. Not to mention the performance gains from using unchecked functions in hot sections of the interpreter loop.


The arbitrary difference between a "toy" and a "tool" is the completeness of the proposed solution. Correctness, ease of use and documentation concerns could come well before execution speed.

I agree that writing an interpreter for an existing language (e.g. Python) one would want to match the performance of existing interpreters and thus would need to use the techniques you mention.


I ask because many implementations of interpreters in Rust that I've seen will reach for unsafe for aspects of the systems that might be inefficient or awkward to implement in safe Rust. I'm curious if or where the author felt the need to do the same or not.


> Statically dispatched type object model using a newtype wrapper and Rust's declarative macros. Ok, what that means is that I have a MetaObject trait that I can use to easily add new data types and define the behavior for specific types.

If I understand correctly, this approach is try and address the expression problem[1]. It makes it easier to define new types for `Variant` (everything is defined in one impl block) at the cost of making it harder to add new `MetaObject` functions (you need to update every applicable impl block).

Also static dispatch seems like the wrong word here, because you are still doing runtime dispatch on `Variant`, even if not using `dyn` Objects. Static dispatch typically refers to monomorphization.

[1]https://craftinginterpreters.com/representing-code.html#the-...


My bad about the incorrect terminology. I figured if it's not dynamic dispatch it must be static but I probably should have done my homework before writing up a description.


Interesting project. Have you considered making an REPL using WASM in browsers as a playground?

I am developing this music live coding language and audio library with Rust and it runs in browsers:

https://glicol.org

I am now using Rhai.rs as the embedding language to write `meta' node in the audio graph. But for audio programming, the running time for each block should not exceed 3ms. In some cases, I found Rhai quite struggling with that. Perhaps dynamic languages have an inherit limitation on that? Wondering how do you see this issue and will performance be part of the future consideration of sphinx-lang?


That would be really cool. I don't know very much about WASM but this seems like a great excuse to learn.


Rust really has a great support for WASM. I made a tiny language as well, and managed to build targeting WASM in a few steps https://github.com/calcit-lang/wasi-calcit/blob/main/.github... .

Tips to notice is WASM runs inside a sandbox, which means threads, random numbers generator(not sure for WASI), FFIs, etc. have to be moved out of the core to prevent them being compiled to WASM, which would probably fail. For major part of the code, they can be compiled to WASM and WASI(WASM System Interface) very easily.


> I'll take suggestions.

Phixns (fixins), "the dyslexic sphinx".


> nonlocal

Oh please, no. This is probably the worst feature of Python.


I'd love to improve this, but without knowing what makes it a misfeature in Python I don't have any guidance.

Maybe the choice of keyword is bad because of the association but it's also not quite the same as in Python. You don't have to "nonlocal x" up front in your function to access a nonlocal variable, which is a pretty huge difference IMO.

My purpose for "nonlocal" was to have a visual highlight of whenever a nonlocal variable is getting modified, because that's the exact point where assignment becomes a side-effect. So you only ever use it when assigning.


Do you have any kind of end goal with the project or are you just winging it and seeing where the language ends up?


Winging it :D


respect!


It is your project but naming conflicts are super frustrating. Please explore the suggestions for something that is a cat and a mythological creature -

- Bastet (Bast) - The beautiful goddess of cats, women's secrets, childbirth, fertility, and protector of the hearth and home from evil or misfortune.

- Mau - The divine cat who, in some stories, is present at the dawn of creation as an aspect of Ra.

https://www.worldhistory.org/article/885/egyptian-gods---the...

https://en.wikipedia.org/wiki/Cultural_depictions_of_cats

https://moderncat.com/articles/cats-mythology/#:~:text=What%....


+1 vote for Mau, since many cats make that sound. (=^・ェ・^=)


Mau reminds me of this markdown-like implementation for ebooks: https://www.thedigitalcatonline.com/blog/2021/02/22/mau-a-li...


What about Harpy? It's another mythological creature and also part of your handle.


Thanks, I'll make it more of a priority to explore an alternate name. Someone else also suggested Bastet so its on my mind.

Mau is cool too but for a lot of people I'd suspect it would be a meaningless word.


Goals are unclear. LuaJit fits definition of current goals. Or just want to [ab]use Rust?

P.S. I know both Lua and Python and you are making syntax intentionally confusing. Do either "fun name() {}" or "def name():". Make it familiar.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: