JSDoc: Global

Members

cache :Object

This code sets up a decoder object to convert encoded tokens back to their original string form, by creating a mapping of the keys and values in the encoder object. It also sets up a byte encoder and decoder object for converting bytes to unicode characters and vice versa. Finally, it initializes a cache Map to store previously processed inputs for faster encoding.

Type:

Object

Source:

Encoder.js, line 120

Methods

bpe(token) → {string}

Implements the Byte Pair Encoding (BPE) algorithm for subword tokenization. The BPE algorithm operates on a vocabulary of subwords, and works by iteratively replacing the most frequent pair of subwords in the vocabulary with a new subword, until a specified vocabulary size is reached. This results in a of subwords that can be used to represent words in a language, while still maintaining some of the structure and meaning of the original words. Here's a breakdown of the function: 1 The function first checks if the input token is in the cache, and if it is, it returns the cached value. This is likely to improve performance by avoiding unnecessary processing for tokens that have already been processed. 2 The input token is then split into individual characters, and a list of pairs of adjacent characters (bigrams) is generated using the get_pairs function. If there are no pairs, the input token is returned as is. 3 The function then enters a loop that continues until a termination condition is met. In each iteration, the pair of subwords with the lowest rank (as determined by the bpe_ranks object) is identified and stored in the bigram variable. If the bigram is not in bpe_ranks, the loop terminates. 4 The bigram is then replaced with a new subword in the word list. The word list is iterated over and any instances of the bigram are replaced with the new subword. 5 The word list is then joined back into a string and stored in the cache. The cached string is returned as the result of the function.

Parameters:

Name	Type	Description
`token`	string	The input token to be tokenized.

Source:

Encoder.js, line 163

Returns:

word - The tokenized subwords as a string.

Type: string

bytes_to_unicode() → {Object.<number, string>}

Returns a mapping of byte values to their corresponding Unicode characters.

Source:

Encoder.js, line 60

Returns:

- A mapping of byte values to Unicode characters.

Type: Object.<number, string>

chr(x) → {string}

Returns the character corresponding to a Unicode code point. inverse of ord

Parameters:

Name	Type	Description
`x`	number	The Unicode code point to get the corresponding character for.

Source:

Encoder.js, line 36

Returns:

- The character corresponding to the given Unicode code point.

Type: string

countTokens(text) → {number}

This function works by iterating through the matches of the pat pattern in the input text, encoding each match using the encodeStr function and the byte_encoder mapping, and then applying the bpe function to the encoded token. The number of tokens produced by the bpe function is then added to the count variable. Finally, the count variable is returned as the result.

Parameters:

Name	Type	Description
`text`

Source:

Encoder.js, line 325

Returns:

Type: number

decode(tokens) → {string}

Decodes a list of BPE tokens into a text string.

Parameters:

Name	Type	Description
`tokens`	Array	The list of BPE tokens to be decoded.

Source:

Encoder.js, line 361

Returns:

text - The decoded text string.

Type: string

encode(text) → {Array}

Encodes a given text string into a list of BPE tokens.

Parameters:

Name	Type	Description
`text`	string	The text to be encoded.

Source:

Encoder.js, line 237

Returns:

bpe_tokens - The encoded BPE tokens.

Type: Array

encodeStr(str) → {Array.<string>}

Encodes a given string as an array of string representations of its UTF-8 encoded bytes.

Parameters:

Name	Type	Description
`str`	string	The string to encode.

Source:

Encoder.js, line 46

Returns:

- An array of string representations of the UTF-8 encoded bytes of the input string.

Type: Array.<string>

get_pairs(word) → {Set.<Array.<string>>}

Returns a set of all the pairs of adjacent characters in a given string.

Parameters:

Name	Type	Description
`word`	string	The string to get pairs of adjacent characters from.

Source:

Encoder.js, line 88

Returns:

- A set of all the pairs of adjacent characters in the string.

Type: Set.<Array.<string>>

ord(x) → {number}

Returns the Unicode code point of the first character in a string. In computer science, the term "ord" is short for "ordinal" or "order"

Parameters:

Name	Type	Description
`x`	string	The string to get the code point of.

Source:

Encoder.js, line 26

Returns:

- The Unicode code point of the first character in the string.

Type: number

range(x, y) → {Array.<number>}

Returns an array of numbers between x and y (inclusive).

Parameters:

Name	Type	Description
`x`	number	The starting number.
`y`	number	The ending number.

Source:

Encoder.js, line 15

Returns:

- An array of numbers between x and y (inclusive).

Type: Array.<number>

tokenStats(input) → {Object}

Computes count, unique, and frequency statistics for a string or an array of tokens. This function can be used to get insights into the characteristics of a text dataset, or to analyze the distribution of tokens in a body of text.

Parameters:

Name	Type	Description
`input`	string \| Array.<number>	The input string or array of tokens.

Properties:

Name	Type	Description
`stats.count`	number	The total number of tokens.
`stats.unique`	number	The number of unique tokens.
`stats.frequency`	Object	An object with token-frequency pairs, sorted by frequency in descending order.
`stats.positions`	Object	An object with token-position pairs, where positions is an array of the indices of the token in the input string or array.
`stats.tokens`	Array.<number>	The array of tokens passed to the function.

Source:

Encoder.js, line 273

Returns:

stats - An object with count, unique, frequency, positions, and tokens properties.

Type: Object