cturan
/
llama.cpp
mirror da https://github.com/cturan/llama.cpp


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
							@llama.cpp
@results
Feature: Results

  Background: Server startup
    Given a server listening on localhost:8080
    And   a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models
    And   a model file test-model-00001-of-00003.gguf
    And   128 as batch size
    And   1024 KV cache size
    And   128 max tokens to predict
    And   continuous batching

  Scenario Outline: consistent results with same seed
    Given <n_slots> slots
    Then  the server is starting
    Then  the server is healthy

    Given 4 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42

    Given concurrent completion requests
    Then the server is busy
    Then the server is idle
    And  all slots are idle
    Then all predictions are equal
    Examples:
      | n_slots |
      | 1       |
      | 2       |

  Scenario Outline: different results with different seed
    Given <n_slots> slots
    Then  the server is starting
    Then  the server is healthy

    Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
    Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 43
    Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 44
    Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 45

    Given concurrent completion requests
    Then the server is busy
    Then the server is idle
    And  all slots are idle
    Then all predictions are different
    Examples:
      | n_slots |
      | 1       |
      | 2       |

  Scenario Outline: consistent results with same seed and varying batch size
    Given 4 slots
    And   <temp> temperature
    # And   0 as draft
    Then  the server is starting
    Then  the server is healthy

    Given 1 prompts "Write a very long story about AI." with seed 42
    And   concurrent completion requests
    # Then the server is busy # Not all slots will be utilized.
    Then  the server is idle
    And   all slots are idle

    Given <n_parallel> prompts "Write a very long story about AI." with seed 42
    And   concurrent completion requests
    # Then the server is busy # Not all slots will be utilized.
    Then the server is idle
    And  all slots are idle

    Then all predictions are equal
    Examples:
      | n_parallel | temp |
      |  1         | 0.0  |
      |  2         | 0.0  |
      |  4         | 0.0  |
      |  1         | 1.0  |
      # FIXME: These tests fail on master. The problem seems to be the unified KV cache.
      # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
      # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574 .
      # |  2         | 1.0  |
      # |  4         | 1.0  |