Post

Deconstructing the Deep TSK Fuzzy Classifier with Stacked Generalization

A detailed technical analysis of the Deep Takagi-Sugeno-Kang Fuzzy Classifier (D-TSK-FC), a novel machine learning model designed to achieve both high performance and clear interpretability. This article breaks down the model's unique stacked architecture, its layer-by-layer training algorithm based on the Least Learning Machine (LLM), and the core principles that guarantee its "triplely concise interpretability."

Deconstructing the Deep TSK Fuzzy Classifier with Stacked Generalization

In the landscape of modern machine learning, a persistent challenge lies at the intersection of performance and transparency. Deep learning architectures have achieved state-of-the-art results across numerous domains, yet their “black box” nature often makes them unsuitable for high-stakes applications where interpretability is paramount. Conversely, traditional rule-based models, such as fuzzy systems, offer clear, human-readable logic but may not reach the same level of predictive accuracy.

The Deep Takagi-Sugeno-Kang Fuzzy Classifier (D-TSK-FC) represents an ambitious attempt to resolve this dichotomy. It is engineered from the ground up to be a “white box” deep model—one that provides both enhanced classification performance and a guarantee of clear interpretability. The model achieves this by creating a novel synergy between the logical structure of TSK fuzzy systems, the error-correcting power of stacked generalization, and the computational efficiency of the Least Learning Machine (LLM).

Base Unit

The most fundamental component of the model is a specialized zero-order TSK fuzzy classifier. This choice is deliberate, prioritizing interpretability at the most granular level.

First Interpretability Guarantee

To ensure the premise (the “IF” part) of the rules are linguistically clear, the model adopts a fixed fuzzy partition strategy.

  • It employs five Gaussian membership functions (GMFs) for every input feature.
  • The centers of these GMFs are permanently fixed at [0, 0.25, 0.5, 0.75, 1].
  • They are assigned the explicit linguistic labels: “very low, low, medium, high, very high”.

This design ensures that the definition of a concept like “low” is consistent across all rules (always centered at 0.25), which avoids the semantic ambiguity that can arise from learned or irregularly shaped membership functions.

Second Interpretability Guarantee

To reduce rule complexity, the model designs rules that do not necessarily use all input features. This simplification is achieved through two randomly generated matrices.

  • Feature Selector Matrix ($\Upsilon$): This is a $d \times K$ matrix (where $d$ is the number of features and $K$ is the number of rules). Each element, $\gamma_{jk}$, acts as a binary “switch,” determining if feature j is used in rule k. If $\gamma_{jk}=0$, the feature is ignored in that specific rule.
  • Rule Combination Matrix ($RC$): This is a $d \times 5 \times K$ binary matrix. For the features selected by the $\Upsilon$ matrix, the $RC$ matrix indicates which of the five fixed GMFs (e.g., “high” or “low”) is adopted for that feature in that rule.

Through the combined action of these two matrices, a simple rule that only focuses on a subset of features is automatically generated, as exemplified in following equation.

IF $x_1$ is low (with $\gamma_{1k}=1$) AND $x_2$ is high (with $\gamma_{2k}=1$) AND $x_3$ is not involved (with $\gamma_{3k}=0$) $\cdot \cdot \cdot$ THEN a class output is $p_0^k$

This design enhances readability and mitigates the risk of the “curse of dimensionality”. Furthermore, the structure of this unit is analogous to a single-layer feedforward network, which allows it to be trained efficiently using the Least Learning Machine (LLM) algorithm.

The Stacked Structure

While the base unit is interpretable, achieving high performance and accuracy requires stacking them to create a deep structure. The model’s deep architecture is built upon the stacked generalization principle, with the goal of correcting prediction errors layer by layer to enhance overall performance.

The data flow mechanism is the most critical innovation of this model and the source of the third interpretability guarantee.

  • Layer 1: The model’s input is the original training dataset, $X_1 = X$. The first base-building unit trains on this data and produces a prediction result, $Y_1$.

  • The Core Mechanism: The information passed to the next layer is not the output $Y_1$ itself. Instead, it is the original training set plus a random shift derived from the previous layer’s prediction results.

  • The input for the next layer, $X_2$, is calculated as follows:

\[X_2 = X + \alpha Y_1 Z_1\]

Here, $X$ is the original training set, $Y_1$ is the prediction output from the first layer, $\alpha$ is a small given constant, and $Z_1$ is a random projection matrix.

Third Interpretability Guarantee

  • This unique data flow ensures that every base unit operates on the same input space as the original data.
  • This is a key distinction from traditional fuzzy systems, where subsequent layers operate on abstract, non-physical outputs from previous layers, making their rules difficult to interpret.
  • In the D-TSK-FC, because each layer’s input is just a slightly perturbed version of the original data $X$, any feature referenced in a rule—regardless of the layer—is always the original feature with its clear physical meaning.

Example

Initial

  • Training Data ($X$): We assume a training set with $N=3$ samples and $d=2$ features. The data has been normalized to a $[0, 1]$ range.
\[X = \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix} = \begin{pmatrix} 0.2 & 0.8 \\ 0.9 & 0.1 \\ 0.3 & 0.6 \end{pmatrix}\]
  • Class Labels ($T$): This is a binary classification problem ($m=2$), with labels represented using one-hot encoding.

    • Sample 1 belongs to Class 2
    • Sample 2 belongs to Class 1
    • Sample 3 belongs to Class 2
\[T = \begin{pmatrix} t_1 \\ t_2 \\ t_3 \end{pmatrix} = \begin{pmatrix} 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{pmatrix}\]
  • Hyperparameters:

    • Perturbation Coefficient: $\alpha = 0.03$.
    • Model Depth: $DP = 2$.

The Calculation

Layer 1
  • Input: The input for the first base-building unit is the original training data, $X_1 = X$.

  • Training and Prediction:

    • The model is trained on the data pair $(X_1, T)$. This process involves randomly generating the rule structure, calculating the hidden layer matrix $H_1$, and using the LLM algorithm to analytically compute the output weights $\beta_1$.
    • After training, the model predicts on the input $X_1$ to get the prediction output matrix $Y_1 = H_1\beta_1$.
    • Assuming the resulting prediction matrix $Y_1$ (a $3 \times 2$ matrix) is as follows:
\[Y_1 = \begin{pmatrix} 0.3 & 0.7 \\ 0.6 & 0.4 \\ 0.4 & 0.6 \end{pmatrix}\]

Interpretation: The model predicts sample 1 belongs to class 2 (0.7 probability), sample 2 to class 1 (0.6 probability), and sample 3 to class 2 (0.6 probability). The directionality is correct, but the confidence is not high.

Input for Layer 2

This is the core step of the stacked architecture. We will now generate the input for the second layer, $X_2$, based on the output of the first layer, $Y_1$.

  • The Core Formula:

    \[X_2 = X + \alpha Y_1 Z_1\]
  • Generate Matrix ($Z_1$):

    • We need a random matrix $Z_1$. For the matrix multiplication to be valid, its dimensions must be $m \times d$ (i.e., $2 \times 2$).
    • Let’s assume the randomly generated $Z_1$ is as follows (with elements between 0 and 1):
\[Z_1 = \begin{pmatrix} 0.8 & 0.2 \\ 0.3 & 0.9 \end{pmatrix}\]
  • Calculate $\alpha Y_1 Z_1$:
\[\begin{pmatrix} 0.3 & 0.7 \\ 0.6 & 0.4 \\ 0.4 & 0.6 \end{pmatrix} \times \begin{pmatrix} 0.8 & 0.2 \\ 0.3 & 0.9 \end{pmatrix} = \begin{pmatrix} 0.45 & 0.69 \\ 0.60 & 0.48 \\ 0.50 & 0.62 \end{pmatrix}\] \[\text{Perturbation} = 0.03 \times \begin{pmatrix} 0.45 & 0.69 \\ 0.60 & 0.48 \\ 0.50 & 0.62 \end{pmatrix} = \begin{pmatrix} 0.0135 & 0.0207 \\ 0.0180 & 0.0144 \\ 0.0150 & 0.0186 \end{pmatrix}\]
  • Final Input for Layer 2 ($X_2$):
\[X_2 = X + \text{Perturbation} = \begin{pmatrix} 0.2135 & 0.8207 \\ 0.9180 & 0.1144 \\ 0.3150 & 0.6186 \end{pmatrix}\]
Layer 2
  • Input: The second base-building unit now receives the newly computed, slightly perturbed dataset $X_2$ as its input.

  • Training and Prediction:

    • The model now trains on the new data pair $(X_2, T)$. Note that the target labels $T$ do not change.
    • The task for this second layer is to learn a mapping from the perturbed data $X_2$ to the original correct labels $T$. The samples in $X_2$ may have been “pushed” into positions that are easier to classify correctly, as the perturbation contained information from the first layer’s predictions.
    • After training, it will produce an output $Y_2$. Since we set the depth $DP=2$, this $Y_2$ is the final prediction result of the entire D-TSK-FC model.

Deep Learning Algorithm

The algorithm’s core principle is layer-by-layer construction, where it iteratively builds and trains each base-building unit until the desired model depth is achieved.

The algorithm takes the training data ($X$), labels ($T$), and the model depth ($DP$) as input. The process unfolds within a main loop that executes $DP$ times, once for each layer.

For any given layer dp within the loop, the key steps are as follows:

  • Rule Premise Initialization: The “IF” part of all fuzzy rules for the current layer is defined. This is done by randomly generating the Feature Selector Matrix ($\Upsilon_{dp}$) and the Rule Combination Matrix ($RC_{dp}$), which injects the necessary structural randomness for the layer.

  • Hidden Layer Calculation: The objective of this phase is to compute the firing strength of every rule for every training sample, forming the hidden layer output matrix, $H_{dp}$.

    • The process begins by calculating the basic membership grades of each feature for the five fixed Gaussian functions.
    • Next, it computes the per-feature activation value, $v_{jl}$.
\[v_{jl}(x_j^i) = \begin{cases} 1 - \prod_{k=1}^{5} (1 - RC_{dp}(j, k, l)u(k, x_{ij})) & \text{if } \gamma_{jl} = 1 \\ 1 & \text{if } \gamma_{jl} = 0 \end{cases}\]

If a feature is ignored ($\gamma_{jl} = 0$), its contribution to the final product is a neutral 1. If the feature is used ($\gamma_{jl} = 1$), its activation is calculated. The $1 - \prod(1 - …)$ structure is a fuzzy logic operator known as the “algebraic sum,” which acts like an “OR” gate. It allows the rule to be activated if the feature belongs to any of the GMFs selected by the $RC$ matrix for that rule.

Finally, the overall firing strength for a rule, $w_{il}$, is the product of all its per-feature activations ($v_{jl}$). These values for all samples and rules form the matrix $H_{dp}$.

  • LLM-based Learning: This is the critical learning step for the layer. The consequent parameters ($\beta_{dp}$) are solved for analytically using the scalable version of the Least Learning Machine formula:

    \[\beta_{dp} = \left(\frac{1}{C}I + H_{dp}^T H_{dp}\right)^{-1} H_{dp}^T T\]

    This non-iterative step, which involves inverting a smaller $K_{dp} \times K_{dp}$ matrix, is key to the algorithm’s efficiency on large datasets.

  • The Stacking: Once the layer is trained, the algorithm prepares for the next. The layer’s predictions, $Y_{dp} = H_{dp}\beta_{dp}$, are calculated. This output is then used to generate the perturbed input for the next layer, $X_{dp+1}$, using the core stacking formula: $X_{dp+1} = X + \alpha Y_{dp} Z_{dp}$.

After the loop completes, the final prediction function of the entire model is the output of the last layer ($Y_{DP} = H_{DP}\beta_{DP}$), and all the interpretable fuzzy rules from each layer can be formally described.

Time Complexity and Prediction

  • Time Complexity: Due to the scalable LLM implementation, the algorithm’s time complexity is approximately linear with respect to the number of training samples, $N$ (when $N \gg K$). This makes the D-TSK-FC a viable choice for large-scale data problems.

  • The Prediction Process: The process mirrors the training flow: z is passed sequentially through the layers. At each layer dp, the model calculates a prediction $y_{dp}$, which is then used to update the sample’s representation for the next layer using the same stacking formula: $z_{dp+1} = z + \alpha y_{dp} Z_{dp}$. The final output from the last layer, DP, is the model’s definitive prediction for the sample z.

This post is licensed under CC BY 4.0 by the author.