AI-Based Prose Programming for Subject Matter Experts: Will This Work?

August 25, 2023 Audrey

Table of Contents

Key Takeaways

Recent advances in prose-to-code generation via Large Language Models (LLMs) will make it practical for non-programmers to “program in prose” for practically useful program complexities, a long-standing dream of computer scientists and subject-matter experts alike.

Assuming that correctness of the code and explainability of the results remain important, testing the code will still have to be done using more traditional approaches. Hence, the non-programmers must understand the notion of testing and coverage.

Program understanding, visualization, exploration, and simulation will become even more relevant in the future to illustrate what the generated program does to subject matter experts.

There is a strong synergy with very high-level programming languages and domain-specific languages (DSLs) because the to-be-generated programs are shorter (and less error prone) and more directly aligned with the execution semantics (and therefore easier to understand).

I think it is still an open question how far the approach scales and how integrated tools will look that exploit both LLMs’ “prose magic” and more traditional ways of computing. I illustrate this with an open-source demonstrator implemented in JetBrains MPS.

Introduction

As a consequence of AI, machine learning, neural networks, and in particular Large Language Models (LLMs) like ChatGPT, there’s a discussion about the future of programming. There are mainly two areas. One focuses on how AI can help developers code more efficiently. We have probably all asked ChatGPT to generate small-ish fragments of code from prose descriptions and pasted them into whatever larger program we were developing. Or used Github Copilot directly in our IDEs.

This works quite well because, as programmers, we can verify that the code makes sense just by looking at it or trying it out in a “safe” environment. Eventually (or even in advance), we write tests to validate that the generated code works in all relevant scenarios. And the AI-generated code doesn’t even have to be completely correct because it is useful to developers if it reaches 80% correctness. Just like when we look up things on Stackoverflow, it can serve as an inspiration/outline/guidance/hint to allow the programmer to finish the job manually. I think it is indisputable that this use of AI provides value to developers.

The second discussion area is whether this will enable non-programmers to instruct computers. The idea is that they just write a prompt, and the AI generates code that makes the machine do whatever they intended. The key difference to the previous scenario is that the inherent safeguards against generated nonsense aren’t there, at least not obviously.

A non-programmer user can’t necessarily look at the code and check it for plausibility, they can’t necessarily bring a generated 80% solution to 100%, and they don’t necessarily write tests. So will this approach work, and how must languages and tools change to make it work? This is the focus of this article.

Why not use AI directly?

You might ask: why generate programs in the first place? Why don’t we just use a general-purpose AI to “do the thing” instead of generating code that then “does the thing”? Let’s say we are working in the context of tax calculation. Our ultimate goal is a system that calculates the tax burden for any particular citizen based on various data about their incomes, expenses, and life circumstances.

We could use an approach where a citizen enters their data into some kind of a form and then submits the form data (say, as JSON) to an AI (either a generic LLM or tax-calculation-specific model), which then directly computes the taxes. There’s no program in between, AI-generated or otherwise (except the one that collects the data, formats the JSON, and submits it to the AI). This approach is unlikely to be good enough in most cases for the following reasons:

AI-based software isn’t good at mathematical calculations ^[1]; this isn’t a tax-specific issue since most real-world domains contain numeric calculations.

If an AI is only 99% correct, the 1% wrong is often a showstopper.

Whatever the result is, it can’t be explained or “justified” to the end user (I will get back to this topic below).

Running a computation for which a deterministic algorithm exists with a neural network is inefficient in terms of computing power and the resulting energy and water consumption.

If there’s a change to the algorithm, we have to retrain the network, which is even more computationally expensive.

To remedy these issues, we use an approach where a subject matter expert who is not a programmer, say our tax consultant, describes the logic of the tax calculation to the AI, and the AI generates a classical, deterministic algorithm which we then repeatedly run on citizens’ data. Assuming the generated program is correct, all the above drawbacks are gone:

It calculates the result with the required numeric precision.

By tracing the calculation algorithm, we can explain and justify the result (again, I will explain this in more detail below).

It will be correct in 100% of the cases (assuming the generated program is correct).

The computation is as energy efficient as any program today.

The generated code can be adapted incrementally as requirements evolve.

Note that we assume correctness for all (relevant) cases and explainability are important here. If you don’t agree with these premises, then you can probably stop reading; you are likely of the opinion that AI will replace more or less all traditionally programmed software. I decidedly don’t share this opinion, at least not for 5–7 years.

Correctness and Creativity

Based on our experience with LLMs writing essays or Midjourney & Co generating images, we ascribe creativity to these AIs. Without defining precisely what “creativity” means, I see it here as a degree of variability in the results generated for the same or slightly different prompts. This is a result of how word prediction works and the fact that these tools employ randomness in the result generation process (Stephen Wolfram explains this quite well in his essay). This feels almost like making a virtue from the fault that neural networks generally aren’t precisely deterministic.

Just do an experiment and ask an image-generating AI to render technical subjects such as airplanes or cranes, subjects for which a specific notion of “correct” exists; jet airliners just don’t have two wings of different lengths or two engines on one wing and one on the other. The results are generally disappointing. If, instead, you try to generate “phantasy dogs running in the forest while it rains,” the imprecision and variability are much more tolerable, to the point we interpret it as “creativity.” Generating programs is more like rendering images of airplanes than of running dogs. Creativity is not a feature for this use case of AI.

Explainability

Let me briefly linger on the notion of explainability. Consider again your tax calculation. Let’s say it asks you to pay 15.323 EUR for a particular period of time. Based on your own estimation, this seems too much, so you ask, “Why is it 15.323 EUR?” If an AI produces the result directly, it can’t answer this question. It might (figuratively) reply with the value for each of the internal neurons’ weights, thresholds, and activation levels of the internal neurons. Still, those have absolutely no meaning to you as a human. Their connection to the logic of tax calculation is, at best, very indirect. Maybe it can even (figuratively) show you that your case looks very similar to these 250 others, and therefore, somehow, your tax amount has to be 15.323 EUR. A trained neural network is essentially just an extremely tight curve fit,one with a huge number of parameters. It’s a form of “empirical programming”: it brute-force replicates existing data and extrapolates.

It’s just like in science: to explain what fitted data means, you have to connect it to a scientific theory, i.e., “fit the curve with physical quantities that we know about. The equivalent of a scientific theory (stretching the analogy a bit) is a “traditional” program that computes the result based on a “meaningful” algorithm. The user can inspect the intermediate values, see the branches the program took, the criteria for decisions, and so on. This serves as a reasonable first-order answer to the “why” question – especially if the program is expressed with abstractions, structures, and names that make sense in the context of the tax domain ^[2].

A well-structured program can also be easily traced back to the law or regulations that back up the particular program code. Program state, expressed with reasonably domain-aligned abstractions, plus a connection to the “requirements” (the law in case of tax calculation) is a really good answer to the “why.” Even though there is research into explainable AI, I don’t think the current approach of deep learning will be able to do this anytime soon. And the explanations that ChatGPT provides are often hollow or superficial. Try to ask “why” one or two more times, and you’ll quickly see that it can’t really explain a lot.

Domain-Specific Tools and Languages

A part of the answer of whether subject-matter expert prose programming works is domain-specific languages (DSL). A DSL is a (software) language that is tailor-made for a particular problem set – for example, for describing tax calculations and the data structures necessary for them, or for defining questionnaires used in healthcare to diagnose conditions like insomnia or drug abuse. DSLs are developed with the SMEs in the field and rely on abstractions and notations familiar to them. Consequently, if the AI generates DSL code, subject matter experts will be more able to read the code and validate “by looking” that it is correct.

There’s an important comment I must make here about the syntax. As we know, LLMs work with text, so we have to use a textual syntax for the DSL when we interact with the LLM. However, this does not mean that the SME has to look at this for validation and other purposes. The user-facing syntax can be a mix of whatever makes sense: graphical, tables, symbolic, Blockly-style, or textual. While representing classical programming languages graphically often doesn’t work well, it works much better if the language has been designed from the get-go with the two syntaxes in mind – the DSL community has lots of experience with this.

More generally, if the code is written by the AI and only reviewed or adapted slightly by humans, then the age-old trade-off between writability and readability is decided in favor of readability. I think the tradeoff has always tended in this direction because code is read much more often than it is written, plus IDEs have become more and more helpful with the writing part. Nonetheless, if the AI writes the code, then the debate is over.

A second advantage to generating code in very high-level languages such as DSLs is that it is easier for the AI to get it right. Remember that LLMs are Word Prediction Machines. We can reduce the risk of predicting wrong by limiting the vocabulary and simplifying the grammar. There will be less non-essential variability in the sentences, so there will be a higher likelihood of correctly generated code. We should ensure that the programming language is good at separating concerns. No “technical stuff” mixed with the business logic the SME cares about.

The first gateway for correctness is the compiler (or syntax/type checker in case of an interpreted language). Any generated program that does not type check or compile can be rejected immediately, and the AI can automatically generate another one. Here is another advantage of high-level languages: you can more easily build type systems that, together with the syntactic structure, constrain programs to be meaningful in the domain. In the same spirit, the fewer (unnecessary) degrees of freedom a language has, the easier it is to analyze the programs relative to interesting properties. For example, a state machine model is easier to model check than a C program. It is also easier to extract an “explanation” for the result, and, in the end, it is easier for an SME to learn to validate the program by reading it or running it with some kind of simulator or debugger. There’s just less clutter, which simplifies everybody’s (and every tool’s) life.

There are several examples that use this approach. Chat Notebooks in Mathematica allow users to write prose, and ChatGPT generates the corresponding Wolfram Language code that can then be executed in Mathematica. A similar approach has been demonstrated for Apache Spark and itemis CREATE, a state machine modeling tool (the linked article is in German, but the embedded video is in English). I will discuss my demonstrator a bit more in the next section.

The approach of generating DSL code also has a drawback: the internet isn’t full of example code expressed in your specific language for the LLM to learn from. However, it turns out that “teaching” ChatGPT the language works quite well. I figure there are two reasons: one is that even though the language is domain-specific, many parts of it, for example, expressions, are usually very similar to traditional programming languages. And second, because DSLs are higher-level relative to the domain, the syntax is usually a bit more “prose-like”; so expressing something “in the style of the DSL I explained earlier” is not a particular challenge for an AI.

The size of the language you can teach to an LLM is limited by the “working memory” of the LLM, but it is fair to assume that this will grow in the future, allowing more sophisticated DSLs. And I am sure that other models will be developed that are optimized for structured text, following a formal schema rather than the structure of (English) prose.

A demonstrator

I have implemented a system that demonstrates the approach of combining DSLs and LLMs. The demonstrator is based on JetBrains’ MPS and ChatGPT; the code is available on github. The example language focuses on forms with fields and calculated values; more sophisticated versions of such forms are used, for example, as assessments in healthcare. Here is an example form:

In addition to the forms, the language also supports expressing tests; these can be executed via an interpreter directly in MPS.

In this video, I show how ChatGPT 3.5 turbo generates meaningfully interesting forms for prose prompts. Admittedly, this is a simple language, and the DSLs we use for real-world systems are more complex. I have also done other experiments where the language was more complicated and it worked reasonably well. And as I have said, LLMs will become better and more optimized for this task. In addition, most DSLs have different aspects or viewpoints, and a user often just has to generate small parts of the model that, from the perspective of the LLM, can be seen as smaller languages.

A brief description of how this demonstrator is implemented technically can be found in the README on github.

Understanding and testing the generated code

Understanding what a piece of code does just by reading it only goes so far. A better way to understand code is to run it and observe the behavior. In the case of our tax calculation example, we might check that the amount of tax our citizen has to pay is correct relative to what the regulations specify. Or we validate that the calculated values in the healthcare forms above have the expected values. For realistically complex programs, there is a lot of variability in the behavior; there are many case distinctions (tax calculations are a striking example, and so are algorithms in healthcare), so we write tests to validate all relevant cases.

This doesn’t go away just because the code is AI-generated. It is even more critical because if we don’t express the prose requirements precisely, the generated code is likely to be incorrect or incomplete – even if we assume the AI doesn’t hallucinate nonsense. Suppose we use a dialog-based approach to get the code right incrementally. In that case, we need regression testing to ensure previously working behavior isn’t destroyed by an AI’s “improvement” of the code. So all of this leads to the conclusion that if we let the AI generate programs, the non-programmer subject matter expert must be in control of a regression test suite – one that has reasonably good coverage.

I don’t think that it is efficient – even for SMEs – to make every small change to the code through a prose instruction to the AI. Over time they will get a feel for the language and make code changes directly. The demo application I described above allows users to modify the generated form, and when they then instruct the LLM to modify further, the LLM continues from the result of the user’s modified state. Users and the LLM can truly collaborate. The tooling also supports “undo”: if the AI changes the code in a way that does more harm than good, you want to be able to roll back. The demonstrator I have built keeps the history of prompt-reply-pairs as a list of nodes in MPS; stepwise undo is supported just by deleting the tail of the list.

So how can SMEs get to the required tests? If the test language is simple enough (which is often the case for DSLs, based on my experience), they can manually write the tests. This is the case in my demonstrator system, where tests are just a list of field values and calculation assertions. It’s inefficient to have an LLM generate the tests based on a much more verbose prose description. This is especially true with good tool support where, as in the demonstrator system, the list of fields and calculations is already pre-populated in the test. An alternative to writing the tests is to record them while the user “plays” with the generated artifact. While I have not implemented this for the demonstrator, I have done it for a similar health assessment DSL in a real project: the user can step through a fully rendered form, enter values, and express “ok” or “not ok” on displayed calculated values.

Note that users still have to think about relevant test scenarios, and they still have to continue creating tests until a suitable coverage metric shows green. A third option is to use existing test case generation tools. Based on analysis of the program, they can come up with a range of tests that achieve good coverage. The user will usually still have to manually provide the expected output values (or, more generally, assert the behavior) for each automatically generated set of inputs. For some systems, such test case generators can generate the correct assertion as well, but then the SME user at least has to review them thoroughly – because they will be wrong if the generated program is wrong. Technically speaking, test case generation can only verify a program, not validate it.

Mutation testing (where a program is automatically modified to identify parts that don’t affect test outcomes) is a good way of identifying holes in the coverage; the nice thing about this approach is that it does not rely on fancy program analysis; it’s easy to implement, also for your own (domain-specific) languages. In fact, the MPS infrastructure on which we have built our demonstrator DSL supports both coverage analysis (based on the interpreter that runs the tests), and we also have a prototype program mutator.

We can also consider having the tests generated by an AI. Of course, this carries the risk of self-fulfilling prophecies; if the AI “misunderstands” the prose specification, it might generate a wrong program and tests that falsely corroborate that wrong program. To remedy this issue, we can have the program and tests generated by different AIs. At the very least, you should use two separate ChatGPT sessions. In my experiments, the ChatGPT couldn’t generate the correct expected values for the form calculations; it couldn’t “execute” the expressions it generated into the form earlier. Instead of generating tests, we can generate properties ^[3] for verification tools, such as model checkers. In contrast to generated tests, generated properties provide a higher degree of confidence. Here’s the important thing: even if tests or properties are generated (based on traditional test generators or via AI), then at least the tests have to be validated by a human. Succeeding tests or tool-based program verifications are only useful if they ensure the right thing.

There’s also a question about debugging. What happens if the generated code doesn’t work for particular cases? Just writing prompts à la “the code doesn’t work in this case, fix it!” is inefficient; experiments with my demonstrator confirm this suspicion. It will eventually become more efficient to adapt the generated code directly. So again: the code has to be understood and “debugged.” A nicely domain-aligned language (together with simulators, debuggers, and other means of relating the program source to its behavior) can go a long way, even for SMEs. The field of program tracing, execution visualization, live programming, and integrated programming environments where there’s less distinction between the program and its executions is very relevant here. I think much more research and development are needed for programs without obvious graphical representations; the proverbial bouncing ball from the original Live Programming demo comes to mind.

There’s also another problem I call “broken magic.” If SMEs are used to things “just working” based on their prose AI prompt, and they are essentially shielded from the source code and, more generally, how the generated program works, then it will be tough for them to dig into that code to fix something. The more “magic” you put into the source-to-behavior path, the harder it is for (any kind of) user to go from behavior back to the program during debugging. You need quite fancy debuggers, which can be expensive to build. This is another lesson learned from years and years of using DSLs without AI.

Summing up

Let’s revisit the skills the SMEs will need in order to reliably use AI to “program” in the context of a particular domain. In addition to being able to write prompts, they will have to learn how to review, write or record tests, and understand coverage to appreciate which tests are missing and when enough tests are available. They have to understand the “paradigm” and structure of the generated code so they can make sense of explanations and make incremental changes. For this to work in practice, we software engineers have to adapt the languages and tools we use as the target of AI code generation:

Smaller and more domain-aligned languages have a higher likelihood that the generated code will be correct and are easier for SMEs to understand; this includes the language for writing tests.

We need program visualizers, animators, simulators, debuggers, and other tools that reduce the gap between a program and its set of executions.

Finally, any means of test case generation, program analysis, and the like will be extremely useful.

So, the promise that AI will let humans communicate with computers using the humans’ language is realistic to a degree. While we can express the expected behavior as prose, humans have to be able to validate that the AI-generated programs are correct in all relevant cases. I don’t think that doing this just via a prose interface will work well; some degree of education on “how to talk to computers” will still be needed, and the diagnosis that this kind of education is severely lacking in most fields except computer science remains true even with the advent of AI.

Of course, things will change as AI improves – especially in the case of groundbreaking new ideas where classical, rule-based AI is meaningfully integrated with LLMs. Maybe more or less manual validation is no longer necessary because the AI is somehow good enough to always generate the correct programs. I don’t think this will happen in the next 5–7 years. Predicting beyond is difficult – so I don’t.

Footnotes

[1] In the future, LLMs will likely be integrated with arithmetic engines like Mathematica, so this particular problem might go away.

[2] Imagine the same calculation expressed as a C program with a set of global integer variables all names i1 through i500. Even though the program can absolutely produce the correct results and is fully deterministic, inspecting the program’s execution – or some kind of report auto-generated from it – won’t explain anything to a human. Abstractions and names matter a lot!

[3] Properties are generalized statements about the behavior of a system that verification tools try to prove or try to find counterexamples for.

Acknowledgments

Thanks to Sruthi Radhakrishnan, Dennis Albrecht, Torsten Görg, Meite Boersma, and Eugen Schindler for feedback on previous versions of this article.

I also want to thank Srini Penchikala and Maureen Spencer for reviewing and copyediting this article.