Sometimes major shifts happen virtually unnoticed. On May 5, IBMannounced Project CodeNet to very little media or academic attention.
CodeNet is a follow-up to ImageNet, a large-scale dataset of images and their descriptions; the images are free for non-commercial uses. ImageNet is now central to the progress of deep learning computer vision.
CodeNet is an attempt to do for Artificial Intelligence (AI) coding what ImageNet did for computer vision: it is a dataset of over 14 million code samples, covering 50 programming languages, intended to solve 4,000 coding problems. The dataset also contains numerous additional data, such as the amount of memory required for software to run and log outputs of running code.
Accelerating machine learning
IBM’s own stated rationale for CodeNet is that it is designed to swiftly update legacy systems programmed in outdated code, a development long-awaited since the Y2K panic over 20 years ago, when many believed that undocumented legacy systems could fail with disastrous consequences.
However, as security researchers, we believe the most important implication of CodeNet — and similar projects — is the potential for lowering barriers, and the possibility of Natural Language Coding (NLC).
In recent years, companies such as OpenAI and Googlehave been rapidly improving Natural Language Processing (NLP) technologies. These are machine learning-driven programs designed to better understand and mimic natural human language and translate between different languages. Training machine learning systems require access to a large dataset with texts written in the desired human languages. NLC applies all this to coding too.
Coding is a difficult skill to learn let alone master and an experienced coder would be expected to be proficient in multiple programming languages. NLC, in contrast, leverages NLP technologies and a vast database such as CodeNet to enable anyone to use English, or ultimately French or Chinese or any other natural language, to code. It could make tasks like designing a website as simple as typing “make a red background with an image of an airplane on it, my company logo in the middle and a contact me button underneath,” and that exact website would spring into existence, the result of automatic translation of natural language to code.
It is clear that IBM was not alone in its thinking. GPT-3, OpenAI’s industry-leading NLP model, has been used to allow coding a website or app by writing a description of what you want. Soon after IBM’s news, Microsoft announced it had secured exclusive rights to GPT-3.
Microsoft also owns GitHub, — the largest collection of open source code on the internet — acquired in 2018. The company has added to GitHub’s potential with GitHub Copilot, an AI assistant. When the programmer inputs the action they want to code, Copilot generates a coding sample that could achieve what they specified. The programmer can then accept the AI-generated sample, edit it or reject it, drastically simplifying the coding