Spaces:

a-ghorbani
/

ai-phone-leaderboard

Running

App Files Files Community

agh123 commited on 13 days ago

Commit

19c7047

1 Parent(s): 6a390d7

feat(scoring): use model size as direct multiplier

Browse files

Files changed (2) hide show

docs/ranking_system.md +77 -0
src/core/scoring.py +5 -13

docs/ranking_system.md ADDED Viewed

	@@ -0,0 +1,77 @@

+# Device Ranking System
+## Overview
+The ranking system implements a multi-dimensional approach to evaluate and compare device performance across different aspects of LLM (GGUF) model runs.
+## Scoring Algorithm
+### Standard Benchmark Conditions
+```python
+PP_CONFIG = 512  # Standard prompt processing token count
+TG_CONFIG = 128  # Standard token generation count
+# Component Weights
+TG_WEIGHT = 0.6  # Token generation weight (60%)
+PP_WEIGHT = 0.4  # Prompt processing weight (40%)
+```
+- PP given 40% weight as it's a one-time cost per prompt
+- TG given higher weight (60%) as it represents ongoing performance
+### Quantization Quality Factors
+```python
+QUANT_TIERS = {
+    "F16": 1.0,
+    "F32": 1.0,
+    "Q8": 0.8,
+    "Q6": 0.6,
+    "Q5": 0.5,
+    "Q4": 0.4,
+    "Q3": 0.3,
+    "Q2": 0.2,
+    "Q1": 0.1,
+}
+```
+- Linear scale from 0.1 to 1.0 based on quantization level
+- F16/F32 are considered 1.0 (this skews the results a bit towards quantization)
+### Performance Score Formula
+The final performance score is calculated as follows:
+1. **Base Performance**:
+   ```
+   base_score = (TG_speed * TG_WEIGHT + PP_speed * PP_WEIGHT)
+   ```
+2. **Size and Quantization Adjustment**:
+   ```
+   # Direct multiplication by model size (in billions)
+   performance_score = base_score * model_size * quant_factor
+   ```
+   - Linear multiplier by model size
+3. **Normalization**:
+   ```
+   normalized_score = (performance_score / max_performance_score) * 100
+   ```
+### Filtering
+- Only benchmarks matching standard conditions are considered:
+  - PP_CONFIG (512) tokens for prompt processing
+  - TG_CONFIG (128) tokens for token generation
+## Data Aggregation Strategy
+### Primary Grouping
+- Groups data by `Normalized Device ID` and `Platform`
+- Uses normalized device IDs to ensure consistent device identification across different submissions
+```python
+def normalize_device_id(device_info: dict) -> str:
+    if device_info["systemName"].lower() == "ios":
+        return f"iOS/{device_info['model']}"
+    memory_tier = f"{device_info['totalMemory'] // (1024**3)}GB"
+    return f"{device_info['brand']}/{device_info['model']}/{memory_tier}"
+```

src/core/scoring.py CHANGED Viewed

@@ -20,8 +20,8 @@ def get_default_quant_tiers() -> Dict[str, float]:
         "Q6": 0.6,  # Still fancy
         "Q5": 0.5,  # The "medium rare" of quantization
         "Q4": 0.4,  # Gets the job done
-        "Q3": 0.3,  # Nice try
-        "Q2": 0.2,  # eh
         "Q1": 0.1,  # At this point, just use a Magic 8-Ball
     }
@@ -36,7 +36,6 @@ class StandardBenchmarkConditions:
     # Weights for different components in scoring
     TG_WEIGHT: float = 0.6  # Token generation weight
     PP_WEIGHT: float = 0.4  # Prompt processing weight
-    SIZE_BONUS_FACTOR: float = 0.2  # Bonus factor for model size
     # Quantization quality tiers
     QUANT_TIERS: Dict[str, float] = field(default_factory=get_default_quant_tiers)
@@ -83,7 +82,7 @@ def calculate_performance_score(df: pd.DataFrame) -> pd.DataFrame:
     This function computes a normalized performance score taking into account:
     - Token generation speed
     - Prompt processing speed
-    - Model size
     - Quantization quality
     Only considers benchmarks that match the standard conditions:
@@ -114,31 +113,25 @@ def calculate_performance_score(df: pd.DataFrame) -> pd.DataFrame:
         df["quant_factor"] = df["Model ID"].apply(
             lambda x: get_quantization_tier(x, std)
         )
-        df["size_factor"] = df["Model Size"] / df["Model Size"].max()
         return df
     # Calculate base metrics (no normalization needed as we're using standard conditions)
     standard_df["normalized_tg"] = standard_df["Token Generation"]
     standard_df["normalized_pp"] = standard_df["Prompt Processing"]
-    # Model size factor (bonus for larger models)
-    standard_df["size_factor"] = (
-        standard_df["Model Size"] / standard_df["Model Size"].max()
-    )
     # Quantization quality factor
     standard_df["quant_factor"] = standard_df["Model ID"].apply(
         lambda x: get_quantization_tier(x, std)
     )
-    # Combined performance score
     standard_df["performance_score"] = (
         (
             standard_df["normalized_tg"] * std.TG_WEIGHT
             + standard_df["normalized_pp"] * std.PP_WEIGHT
         )
         * standard_df["quant_factor"]  # Apply quantization penalty
-        * (1 + standard_df["size_factor"] * std.SIZE_BONUS_FACTOR)  # Apply size bonus
     )
     # Normalize final score to 0-100 range
@@ -157,7 +150,6 @@ def calculate_performance_score(df: pd.DataFrame) -> pd.DataFrame:
                 "Model ID",
                 "performance_score",
                 "quant_factor",
-                "size_factor",
             ]
         ],
         on=["Device", "Platform", "Model ID"],

         "Q6": 0.6,  # Still fancy
         "Q5": 0.5,  # The "medium rare" of quantization
         "Q4": 0.4,  # Gets the job done
+        "Q3": 0.3,  # Nice try
+        "Q2": 0.2,  # eh
         "Q1": 0.1,  # At this point, just use a Magic 8-Ball
     }
     # Weights for different components in scoring
     TG_WEIGHT: float = 0.6  # Token generation weight
     PP_WEIGHT: float = 0.4  # Prompt processing weight
     # Quantization quality tiers
     QUANT_TIERS: Dict[str, float] = field(default_factory=get_default_quant_tiers)
     This function computes a normalized performance score taking into account:
     - Token generation speed
     - Prompt processing speed
+    - Model size (direct multiplier)
     - Quantization quality
     Only considers benchmarks that match the standard conditions:
         df["quant_factor"] = df["Model ID"].apply(
             lambda x: get_quantization_tier(x, std)
         )
         return df
     # Calculate base metrics (no normalization needed as we're using standard conditions)
     standard_df["normalized_tg"] = standard_df["Token Generation"]
     standard_df["normalized_pp"] = standard_df["Prompt Processing"]
     # Quantization quality factor
     standard_df["quant_factor"] = standard_df["Model ID"].apply(
         lambda x: get_quantization_tier(x, std)
     )
+    # Combined performance score using model size as direct multiplier
     standard_df["performance_score"] = (
         (
             standard_df["normalized_tg"] * std.TG_WEIGHT
             + standard_df["normalized_pp"] * std.PP_WEIGHT
         )
+        * standard_df["Model Size"]  # Direct size multiplier
         * standard_df["quant_factor"]  # Apply quantization penalty
     )
     # Normalize final score to 0-100 range
                 "Model ID",
                 "performance_score",
                 "quant_factor",
             ]
         ],
         on=["Device", "Platform", "Model ID"],