How we crafted a domain-specific language for JSON transformation at RudderStack

RudderStack created a JSON Template Engine to simplify transformation of JSON data from one format to another, making it easier to manage and maintain complex integrations. This blog post will cover why we needed to craft our own Domain-Specific Language for JSON transformation and how we did it.

First, let’s understand the background about the problem that we were trying to solve and why we needed to create our own JSON Template Engine.

The challenge

RudderStack is the Warehouse Native CDP. We provide an integrated solution for data collection, unification in the warehouse, and activation. Our platform supports over 200 integrations and features a powerful Transformations tool. Traditionally, we used native JavaScript code for data transformation, which required significant effort and maintenance. Writing intricate JavaScript code for complex JSON transformations can be error-prone and time-consuming. Moreover, JavaScript’s general-purpose nature did not provide the level of abstraction and expressiveness needed to succinctly represent JSON transformation logic. Although JSONata offered a more efficient way to manipulate JSON data, we still encountered performance bottlenecks due to its parsing and interpretation overhead.

Our solution

The solution was to use a domain-specific language tailored specifically for JSON transformation. By designing a custom JSON template language, we can provide developers with a specialized syntax and semantics optimized for JSON manipulation tasks. Such a language would abstract away low-level JavaScript details, simplify complex transformation logic, and enhance readability and maintainability.

With that goal in mind, we developed our own JSON Transformation Engine. This engine generates optimized JavaScript code from transformation templates, reducing runtime overhead and significantly improving performance.

Steps to build a domain-specific language

Here’s how we crafted our customer JSON template language. You can follow a similar process to create a language for your problem domain.

1. Define the domain and requirements

Start by clearly defining the domain for which you’re building the DSL — in our case, JSON transformation. Identify the specific requirements and challenges within that domain, such as the need for concise syntax, support for complex data structures, and efficient execution.

2. Design language syntax and semantics

Based on the identified requirements, design the syntax and semantics of your DSL — in our case, the JSON template language. Define language constructs such as statements, expressions, and control flow mechanisms that enable users to express JSON transformation logic in a clear and concise manner.

3. Implement lexing (tokenization)

Lexical analysis involves breaking down the source code into tokens, the smallest units of meaningful characters in the language. Implement a lexer to tokenize the input JSON template code, identifying keywords, identifiers, operators, and other lexical elements.

In order to understand how we approach this tokenization, let’s look at the implementation example of descendant operator `..`. This operator is used to search for a specific key in all descendants of a property.

To begin, we must first locate the descendant operator within the code. This can be achieved by creating a generic function as part of the Lexer, which is responsible for identifying various punctuators that include dots.

JAVASCRIPT
/**
 * Scans the provided code characters for punctuator tokens, specifically focusing on variations of dots ('.', '..', '(...)').
 * 
 * @param {string[]} codeChars - An array of characters representing the code being scanned.
 * @param {number} idx - The current index within the code to start scanning from.
 * @returns {Token | undefined} - Returns a Token object representing the identified punctuator or undefined if no match is found.
 */
function scanPunctuatorForDots(codeChars: string[], idx: number): Token | undefined {
  const start = idx; // Store the starting index for the token

  // Extract the characters at specific positions relative to the current index
  const ch1 = codeChars[idx];
  const ch2 = codeChars[idx + 1];
  const ch3 = codeChars[idx + 2];

  // Check if the first character is a dot ('.')
  if (ch1 !== '.') {
    return undefined; // No match, return undefined
  }

  // Handle different punctuator variations involving dots:
  if (ch2 === '(' && ch3 === ')') {
    return {
      type: TokenType.PUNCT,
      value: '.()',
      range: [start, idx + 3], // Update range to include all three characters
    };
  }
  if (ch2 === '.' && ch3 === '.') {
    return {
      type: TokenType.PUNCT,
      value: '...',
      range: [start, idx + 3], // Update range to include all three characters
    };
  }
  if (ch2 === '.') {
    return {
      type: TokenType.PUNCT,
      value: '..',
      range: [start, idx + 2], // Update range to include both dots
    };
  }

  // Default case: single dot
  return {
    type: TokenType.PUNCT,
    value: '.',
    range: [start, idx + 1], // Update range to include single dot
  };
}

4. Implement parsing (syntax analysis)

Parsing involves constructing a parse tree (or Abstract Syntax Tree — AST) from the tokenized source code. Implement a parser to generate the AST according to the grammar rules defined for the language.

After successfully identifying the descendant selector token in the previous step, we are now proceeding to combine it with other tokens. By doing so, we are creating an expression or Abstract Syntax Tree (AST) that represents the selector in a structured manner.

JAVASCRIPT
/**
 * Checks if the current token from the lexer is a dot (.) or double-dot (..) punctuator.
 * 
 * @returns {boolean} True if the current token is a dot or double-dot punctuator, false otherwise.
 */
function matchPathPartSelector(): boolean {
  const token = this.lexer.lookahead(); // Peek at the next token without consuming it
  if (token.type === TokenType.PUNCT) {
    return token.value === '.' || token.value === '..'; // Check if the value is '.' or '..'
  }
  return false;
}

/**
 * Parses a selector expression, which can be a simple identifier, wildcard (*), or a string literal.
 * 
 * @returns {SelectorExpression | IndexFilterExpression | Expression} The parsed selector expression object.
 */
function parseSelector(): SelectorExpression | IndexFilterExpression | Expression {
  const selector = this.lexer.value(); // Consume the current token as the selector
  // Other code

  let prop: Token | undefined;
  if (this.lexer.match('*') || this.lexer.matchID() || this.lexer.matchTokenType(TokenType.STR)) { // Check for specific token types
    prop = this.lexer.lex(); // Consume the next token as the property
  }
  return {
    type: SyntaxType.SELECTOR,
    selector,
    prop,
  };
}

/**
 * Parses a path part, which can be an expression, an array of expressions, or a selector expression based on the current token.
 * 
 * @returns {Expression | Expression[] | undefined} The parsed path part or undefined if not applicable.
 */
function parsePathPart(): Expression | Expression[] | undefined {
  // Other code
  } else if (matchPathPartSelector()) { // Check if the current token is a dot or double-dot
    return parseSelector(); // Parse a selector if it is
  } 
  // Other code
}

The above functions work together to identify and parse different parts of a path, with a focus on recognizing selectors within the path structure. They rely on a separate lexer module that provides functionality for reading and identifying different token types in the input stream.

This is the Abstract Syntax Tree (AST) representation for the code expression .employees..name

JAVASCRIPT
{
  "type": "statements_expr",
  "statements": [
    {
      "type": "path",
      "parts": [
        {
          "type": "selector",
          "selector": ".",
          "prop": {
            "type": "id",
            "value": "employees",
            "range": [
              1,
              10
            ]
          }
        },
        {
          "type": "selector",
          "selector": "..",
          "prop": {
            "type": "id",
            "value": "name",
            "range": [
              12,
              16
            ]
          }
        }
      ],
      "pathType": "rich"
    }
  ]
}

5. Implement code translation

Translate the parsed AST into executable code in a target language (e.g., JavaScript). This involves traversing the AST and generating code that performs the specified JSON transformations as defined by the DSL.

The final step involves converting the Descendant selector Expression (AST) into JavaScript code. This step will transform the structured representation of the selector into executable JavaScript code that can be used in the desired context.

JAVASCRIPT
/**
 * Translates a selector expression with descendant operator (..) into executable code.
 * 
 * @param {SelectorExpression} expr - The selector expression containing the descendant operator.
 * @param {string} dest - The variable name to store the final result.
 * @param {string} baseCtx - The starting context for traversing descendant properties.
 * @returns {string} The generated JavaScript code representing the translation.
 */
function translateDescendantSelector(
  expr: SelectorExpression,
  dest: string,
  baseCtx: string,
): string {
  const code: string[] = []; // Array to store generated code lines

  // Acquire temporary variables for the translation process
  const ctxs = this.acquireVar();
  const currCtx = this.acquireVar();
  const result = this.acquireVar();

  // Initialize the result variable to an empty array
  code.push(JsonTemplateTranslator.generateAssignmentCode(result, '[]')); // Call a helper function to generate assignment code

  // Extract the property from the selector expression (if any)
  const { prop } = expr;
  const propStr = CommonUtils.escapeStr(prop?.value); // Escape the property value for safe string inclusion

  // Push initial code to set up the context list
  code.push(`${ctxs}=[${baseCtx}];`); // Assign the base context to the contexts list

  // Loop through contexts while there are more to process
  code.push(`while(${ctxs}.length > 0) {`);
  // Shift the current context from the list
  code.push(`${currCtx} = ${ctxs}.shift();`);

  // Handle empty contexts (skip if empty)
  code.push(`if(${JsonTemplateTranslator.returnIsEmpty(currCtx)}){continue;}`); // Call a helper function to check for emptiness

  // Handle context being an array (recursively process elements)
  code.push(`if(Array.isArray(${currCtx})){`);
  code.push(`${ctxs} = ${ctxs}.concat(${currCtx});`); // Concatenate the array elements to the contexts list
  code.push('continue;'); // Skip to the next iteration
  code.push('}');

  // Handle context being an object (process its properties)
  code.push(`if(typeof ${currCtx} === "object") {`);
  const valuesCode = JsonTemplateTranslator.returnObjectValues(currCtx); // Call a helper function to get object values
  code.push(`${ctxs} = ${ctxs}.concat(${valuesCode});`); // Concatenate object values to the contexts list
  if (prop) { // If there's a property in the selector
    if (prop?.value === '*') { // If the property is a wildcard (*)
      code.push(`${result} = ${result}.concat(${valuesCode});`); // Concatenate all object values to the result
    } else { // If the property is a specific key
      code.push(`if(Object.prototype.hasOwnProperty.call(${currCtx}, ${propStr})){`); // Check if the property exists on the object
      code.push(`${result} = ${result}.concat(${currCtx}[${propStr}]);`); // Append the property value to the result
      code.push('}');
    }
  }
  code.push('}');

  // If no property was specified, add the entire current context to the result
  if (!prop) {
    code.push(`${result}.push(${currCtx});`);
  }

  // Close the loop
  code.push('}');

  // Flatten the final result array (remove nested arrays)
  code.push(`${dest} = ${result}.flat();`);

  // Join all code lines and return the generated code
  return code.join('');
}

This code translates a selector expression containing the descendant operator (..) into executable JavaScript code. It iterates through a list of contexts, starting with a provided base context. For each context, it checks if it’s an array and recursively processes its elements. If it’s an object, it retrieves its property values and adds them to the context list for further processing. The code also considers a property specified in the selector: if it’s a wildcard (*), all object values are included in the result; otherwise, only the value for the specific property key is included. Finally, the code flattens the result array to remove any nested arrays and stores it in a designated variable.

Below is the code generated for the expression .employees..name, the code has been modified from the original generated code for better readability.

JAVASCRIPT
// This function takes an input object and processes its 'employees' property to extract names into an array.
function extractEmployeeNames(inputObject) {
  let result; // Initialize variable to store final result
  let currentObject; // Temporary variable for iterating over input object
  let employeesArray; // Temporary variable for storing 'employees' array
  let i; // Counter variable for looping over input object
  let j; // Counter variable for looping over 'employees' array
  let currentEmployee; // Temporary variable for each 'employee' object
  let extractedNames; // Temporary variable for storing names extracted from 'employee' object
  let collectedNames; // Array to collect extracted names
  let queue; // Queue for BFS traversal of objects
  let currentQueueItem; // Temporary variable for BFS traversal
  let tempNames; // Temporary array to collect names during traversal

  // Initialize result with the input object
  result = inputObject;

  // Initialize collectedNames as an empty array to collect extracted names
  collectedNames = [];

  // Assign currentObject to input object
  currentObject = result;

  // Check if currentObject is not null or undefined
  if (currentObject !== null && currentObject !== undefined) {
    // Convert currentObject to an array if it's not already one
    currentObject = Array.isArray(currentObject) ? currentObject : [currentObject];
  }

  // Loop through each item in currentObject
  for (i = 0; i < currentObject.length; i++) {
    // Assign employeesArray to 'employees' property of current item
    employeesArray = currentObject[i]?.employees;

    // Continue if employeesArray is null or undefined
    if (employeesArray === null || employeesArray === undefined) {
      continue;
    }

    // Loop through each item in employeesArray
    for (j = 0; j < employeesArray.length; j++) {
      // Assign currentEmployee to current item in employeesArray
      currentEmployee = employeesArray[j];

      // Initialize tempNames as an empty array to collect names
      tempNames = [];

      // Initialize queue with currentEmployee in an array
      queue = [currentEmployee];

      // Perform BFS traversal on queue until it's empty
      while (queue.length > 0) {
        // Pop the first item from queue
        currentQueueItem = queue.shift();

        // Continue if currentQueueItem is null or undefined
        if (currentQueueItem === null || currentQueueItem === undefined) {
          continue;
        }

        // If currentQueueItem is an array, concatenate it with queue
        if (Array.isArray(currentQueueItem)) {
          queue = queue.concat(currentQueueItem);
          continue;
        }

        // If currentQueueItem is an object, extract values and filter out null or undefined ones
        if (typeof currentQueueItem === "object") {
          queue = queue.concat(Object.values(currentQueueItem).filter(v => v !== null && v !== undefined));
          // If 'name' property exists in currentQueueItem, add it to tempNames
          if (currentQueueItem.hasOwnProperty('name')) {
            tempNames = tempNames.concat(currentQueueItem.name);
          }
        }
      }

      // Flatten tempNames and assign it to extractedNames
      extractedNames = tempNames.flat();

      // Continue if extractedNames is null or undefined
      if (extractedNames === null || extractedNames === undefined) {
        continue;
      }

      // Push extractedNames into collectedNames
      collectedNames.push(extractedNames);
    }
  }

  // If collectedNames has only one element, assign that element to collectedNames
  collectedNames = collectedNames.length < 2 ? collectedNames[0] : collectedNames;

  // Assign result to collectedNames
  result = collectedNames;

  // Return the final result
  return result;
}

Conclusion

Building a DSL at RudderStack empowered our engineering team to simplify complex workflows and scale our efforts in building and managing 100s of integrations for our Customer Data Platform. This guide covered the process we used to craft a domain-specific language (DSL) for JSON transformation and build a tailored solution to streamline data integration challenges.

We covered everything from understanding the need for a DSL to implementing lexing, parsing, and code translation. Following this guide, you can create your own custom DSLs to address specific domain requirements.

March 28, 2024