Up(sun) and ready with Pandoc

Up(sun) and ready with Pandoc

February 12, 2025· Florent Huck
Florent Huck
·Reading time: 6 minutes

With the recent growth in enthusiasm for AI assistants, you may be looking for how these assistants can provide technical information about your product. After years of remarkable achievements through web 2.0 and its famous robots.txt, security.txt, and humans.txt, a new standard has been proposed to the web ecosystem and will soon become essential for the web: llms.txt. llms.txt was conceived by Jeremy Howard, co-founder of Answer.AI, to address a fundamental challenge in AI-human interaction.

When AI assistants attempt to process standard web pages, they struggle with non-essential elements like navigation menus, scripts, and styling. These elements consume valuable context space without contributing to the actual content understanding. llms.txt provides an elegant solution: it delivers precisely curated information in a format that AI systems can efficiently process and understand.

If you need to convert files from one markup format to another, Pandoc is your swiss-army knife. Developed by John MacFarlane, Pandoc is a Haskell library for converting from one markup format to another and John provides in this pandoc repo a command-line tool that uses this Pandoc library. Easy to install and ready to convert.

In this How-to guide, we will see how to install this pandoc command line tool on your Upsun project.

Assumptions:

  • You already have an Upsun account. If you don’t, please register for a trial account. You can sign up with an email address or an existing GitHub, Bitbucket, or Google account. If you choose one of these accounts, you can set a password for your Upsun account later.
  • You have the Upsun CLI installed locally.
  • You have the Git CLI installed locally.

For this tutorial, we will start with a basic HTML application. The main goal of this tutorial is to showcase how to install pandoc on your project and quickly generate a llms.txt file from your HTML pages.

Prepare your local HTML project

In order to quickly showcase the strength of Pandoc, we will simulate a simple HTML application that could be obtained using a static website generator like Hugo. The proposed structure will be:

    • config.yaml
      • api.html
      • applications.html
    • index.html
  • To do so, in your Terminal, execute the following commands:

    Terminal
    mkdir my-html-app
    cd my-html-app
    mkdir public
    curl -L https://raw.githubusercontent.com/upsun/snippets/refs/heads/main/src/llms/html-app-example.tar.gz | tar -xvz - -C public
    git init && git add . && git commit -m "init HTML app"
    🚨 Please note: This html-app-example.tar.gz file contains all HTML files (index.html, ./learn/*.html) in this llms folder.

    Give Pandoc a try

    To showcase the power of Pandoc, let’s give it a try locally and convert our HTML to an llms.txt file.

    Install Pandoc locally

    To install Pandoc locally, please follow the official Installation Guide.

    Use Pandoc for HTML to Markdown conversion

    You should now have access to pandoc tool and we will use it to generate a public/llms-test.txt file that will concatenate all the HTML pages of the project in Markdown. Let’s execute this command line that will look for all HTML files in the public folder and concat them in a single file ./public/llms-test.txt:

    Terminal
    pandoc $(find ./public -iname "*.html" -type f | sort -d) -f html -s -o "./public/llms-test.txt" -t markdown
    open public/llms-test.txt

    Now that you can see the power of Pandoc, feel free to check the Official Manual for more advanced usage.

    Use Pandoc in your Upsun project

    Generating this llms.txt file locally and pushing it in your source code is not convenient. We would like this generation to be dynamic, each time you update your website content.

    Init your Upsun config

    Upsun CLI provides a command to initialize a basic config for your local code. As it is a simple HTML app, we will generate a minimum configuration file using the following command:

    Terminal
    ➜  my-html-app git:(main) upsun project:init
    Welcome to Upsun!
    Let's get started with a few questions.
    
    We need to know a bit more about your project. This will only take a minute!
    
    What language is your project using? We support the following: [JavaScript/Node.js]
    
    Tell us your project's application name: [app]
    
    
                           (\_/)
    We’re almost done...  =(^.^)=
    
    Last but not least, unless you’re creating a static website, your project uses services. Let’s define them:
    
    Select all the services you are using: []
    
    You have not selected any service, would you like to proceed anyway? [Yes]
    
    ┌───────────────────────────────────────────────────┐
    │   CONGRATULATIONS!                                │
    │                                                   │
    │   We have created the following files for your:   │
    │     - .environment                                │
    │     - .upsun/config.yaml                          │
    │                                                   │
    │   We’re jumping for joy! ⍢                        │
    └───────────────────────────────────────────────────┘
             │ /
             │/
      (\ /)
      ( . .)
      o (_(“)(“)

    Please select

    • Javascript/Node.js
    • application name: app
    • no service selected

    Your HTML application is almost ready to be deployed on Upsun, one more step to go.

    Update this config line into the newly created .upsun/config.yaml file for the router to point to your public folder:

    .upsun/config.yaml
    1
    2
    3
    4
    5
    6
    7
    8
    
    applications:
      app:
        web:
          locations:
            "/":
              root: "public"
              index: ["index.html"]
              passthru: true

    and then commit your updates:

    Terminal
    git add .upsun/config.yaml && git commit -m "change locations.root to the public folder"

    Create an Upsun project

    You then need to create an Upsun project by executing these commands and follow the prompts:

    Terminal
    upsun project:create
    upsun push

    Install Pandoc

    There is to ways to install pandoc on your project:

    Using a shell script

    John MacFarlane provides in his Pandoc repo a quick and easy way to install Pandoc.

    We’ve prepared a shell script for you (source) that can be used to install the latest version of Pandoc. Update your .upsun/config.yaml file and add this curl call in your applications.app.hooks.build step:

    .upsun/config.yaml
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    
    applications:
      app:
        type: "nodejs:20"
        #...
        hooks:
          build: |
            set -x -e
            #...
            curl -fsS https://raw.githubusercontent.com/upsun/snippets/refs/heads/main/src/install-github-asset.sh | bash /dev/stdin "jgm/pandoc" 
            pandoc -v        

    The install-pandoc.sh script installs the pandoc binary from Pandoc repo in the /app/.global/bin folder of your application container.

    Using Composable image

    The Upsun Composable image provides enhanced flexibility when defining your app. It allows you to install several runtimes and tools in your application container, in a “one image to rule them all” approach.

    The composable image is built on Nix and the good is Pandoc package is available.

    Update your .upsun/config.yaml by commenting default type parameter and by adding the following lines:

    .upsun/config.yaml
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    
    applications:
      app:
        #type: "nodejs:20"
        stack: 
          - pandoc
        #...
        hooks:
          build: |
            set -x -e
            pandoc -v        

    And then, deploy your updates:

    Terminal
    git add .upsun/config.yaml .environment && git commit -m "install Pandoc"
    upsun push

    Use Pandoc dynamically

    You can now use pandoc in your project to dynamically generate a public/llms.txt file that will concatenate all the HTML pages in Markdown, as tested locally before. Update your .upsun/config.yaml by adding the following lines:

    .upsun/config.yaml
    1
    2
    3
    4
    5
    6
    7
    8
    
    applications:
      app:        
        #...
        hooks:
          build: |
            set -x -e
            #...
            pandoc $(find ./public -iname "*.html" -type f | sort -d) -f html -s -o "./public/llms.txt" -t markdown        

    And then, deploy your updates:

    Terminal
    git add .upsun/config.yaml && git commit -m "Use Pandoc to generate a public/llms.txt file"
    upsun push

    Test it works by accessing the file by adding /llms.txt to your environment URL:

    Terminal
    upsun env:url --primary

    Conclusion

    Et voilà, we saw how to use pandoc to convert all existing HTML pages into a single Markdown public/llms.txt file. Now, perhaps the next step would be to train an AI Assistant with the file llms.txt

    Stay tuned.

    Discover how to deploy a personal Chainlit AI assistant on Upsun by reading this great blogpost: Experiment with Chainlit AI interface with RAG on Upsun

    Last updated on