Code execution

I’ve added a feature to my flashcard site that I’m really excited about. It’s something that I think will be a game-changer for my little flashcard app, especially for those users who, like me, are always trying to level up their programming skills. I’m talking about code execution.

# The Why

When I started building this app, I envisioned it as a tool to help people learn and retain knowledge effectively. Flashcards are great for memorizing facts, but programming is about more than just syntax. It’s about problem-solving, logic, and applying concepts in practice. And yes, some boring repetition is required to make it stick.

When I was learning Python, I used platforms like CodeWars extensively, and enjoyed them a lot. There’s a lot of value in near-instant feedback and a hands-on approach.

# The How

This wasn’t a trivial feature to implement. I had to think about security, performance, user experience, and how it all fits into the existing architecture. After weighing different options, I decided to go with a sandboxed Docker-based approach, orchestrated by a FastAPI backend.

(The other option was to use WASM and run the code in the browser, but that would have been a lot more work due my lack of WASM experience, and would have been less secure. I will detail this in a future post and even envision a future where I can run WASM code in the browser to provider near-realtime feedback but still validate the code in a sandboxed environment on the server.)

# Architecture

The basic flow looks like this:

  1. Frontend: The user writes code in a Monaco Editor component (the same editor that powers VS Code, which is just awesome).
  2. API: When the user clicks “Run”, the code is sent to a FastAPI endpoint.
  3. Backend: The backend service receives the request and spins up a new Docker container for the specific language environment.
  4. Sandbox: The code is executed inside the container, isolated from the host system, with strict resource limits (CPU, memory, network access).
  5. Results: The output, along with the results of the test cases, is sent back to the frontend and displayed to the user.
+--------------+     +--------------+     +-------------+
|   Frontend   |     |   FastAPI    |     |   Docker    |
| Monaco Editor|---->| + Redis      |---->| Sandbox     |
|              |<----| Cache/Limits |<----| Containers  |
+--------------+     +--------------+     +-------------+

# Security Considerations

Security was a top priority. Running arbitrary user code is a risky business, so I had to take precautions:

  • Sandboxing: Docker containers provide a good level of isolation. Each execution happens in a fresh container that’s destroyed afterward.
  • Resource Limits: I’ve set strict limits on CPU usage, memory, and network access within the containers to prevent abuse and ensure fair usage.
  • Rate Limiting: To prevent denial-of-service attacks and manage load, I implemented rate limiting using Redis. Each user has a limited number of executions per minute.
  • Input Sanitization: The backend carefully sanitizes the code input to prevent any attempts to escape the sandbox or execute malicious commands.
  • Network Restrictions: Strict network access restrictions prevent unauthorized external connections.
  • Filesystem Isolation: Complete isolation of the filesystem ensures code can’t access or modify unauthorized files.
  • Import Whitelisting: Only pre-approved imports and libraries are allowed to prevent malicious package usage.

# Implementation Details

The core of the backend logic is in the CodeExecutionService. It’s responsible for creating the Docker containers, executing the code, and managing the lifecycle of the execution environment. It currently only supports Python, but I think it won’t be a huge effort to support other languages.

On the frontend, the CodeEditor component provides a user-friendly interface for writing and running code. It features syntax highlighting, basic autocompletion, and a clean layout. I also added a section to display the test results, with clear indications of passed and failed cases.

# Test Cases and Auditing

To ensure the quality and correctness of the coding flashcards, I implemented an auditing script. This script runs the provided answer code against the defined test cases and updates the flashcard metadata with the results. It’s a crucial part of the content pipeline, ensuring that only high-quality, working code examples make it into the app.

The test cases come in two varieties:

  • Visible test cases that help users understand what’s expected
  • Hidden test cases that prevent solution hardcoding and ensure thorough understanding

When tests fail, the system provides detailed comparisons between expected and actual outputs, similar to modern testing frameworks, making debugging straightforward and educational.

# Technical Deep Dive

# Test Case Handling

One of the more interesting technical challenges was handling test cases effectively. The system supports both simple and complex inputs through a parsing system, which allows users to test functions with various input types, from simple integers to complex nested data structures.

# Automated Quality Assurance

I’ve implemented an automated audit system that continuously validates all coding flashcards. What makes this interesting is its ability to:

  • Automatically detect and validate coding questions
  • Execute test cases in a safe environment
  • Generate detailed test reports
  • Attempt to fix failing cards using AI assistance
  • Maintain an audit trail of all validations

# Error Recovery and AI Assistance

When a coding card fails validation, the system doesn’t just give up. Instead, it employs a multi-stage recovery process:

  1. First, it audits the test cases themselves to ensure they look valid - do they expect the right output?
  2. If the test cases are good but the code fails, it attempts to fix the code
  3. All fixes are reviewed and stored as suggestions rather than automatic updates
  4. Detailed error reports help identify common failure patterns

This approach ensures that the content quality remains high while providing valuable insights into where users might struggle.

# Performance Deep Dive

one of the most exciting aspects of this feature is its performance - the entire round trip from clicking “run” to seeing results typically takes less than 700ms. here’s how we achieved this:

# optimized docker setup

we use a highly optimized docker image for code execution:

FROM python:3.9-slim

# performance optimizations
ENV PYTHONUNBUFFERED=1
ENV PIP_DISABLE_PIP_VERSION_CHECK=1
ENV PIP_NO_CACHE_DIR=1

this minimal setup provides several advantages:

  • the slim python image is only ~120mb (compared to ~900mb for the full image)
  • no additional packages or dependencies are installed
  • environment variables are optimized for fast execution
  • running as a non-root user improves both security and startup time

# docker’s built-in optimizations

while our code appears to create and destroy containers for each execution, we benefit from docker’s internal optimizations:

  1. container layering: docker maintains a cache of image layers, making subsequent container creations nearly instant
  2. copy-on-write filesystem: new containers start immediately using docker’s copy-on-write mechanism
  3. warm pool: the docker daemon keeps recently used containers warm in memory

# Future Improvements

This is just the first iteration of the code execution feature. I have many ideas for future improvements:

  • More Languages: Currently, I support Python and JavaScript, but I plan to add more languages based on user demand.
  • Advanced Test Cases: I want to support more complex test scenarios, including edge cases, performance tests, and even randomized inputs.
  • User-Defined Test Cases: Allowing users to create and share their own test cases could be a powerful way to crowdsource quality content.
  • Interactive Console: Adding a REPL-like console for interactive experimentation would enhance the learning experience.
  • Performance: I want to keep the execution time as low as possible to provide a smooth user experience. I optimized the Docker images, implemented caching for common operations, and used resource pooling to minimize container startup time.

# Conclusion

I believe this feature will significantly enhance the value of my flashcard app for programming learners. It’s one thing to read about a concept, but it’s another to immediately apply it and see the results. I’m excited to see how users will interact with this feature and how it will help them on their learning journey.

As always, I’m open to feedback and suggestions. This app is a labor of love, and I’m constantly striving to improve it. So, if you have any ideas or thoughts, please don’t hesitate to reach out!

Written on January 16, 2025

If you notice anything wrong with this post (factual error, rude tone, bad grammar, typo, etc.), and you feel like giving feedback, please do so by contacting me at hello@samu.space. Thank you!