How do I measure the runtime of my simulation?
Use the Ruby_cycles metric to measure the runtime of the simulated system. This is contained in the .stats file created when dumping Ruby statistics with the dump-stats command.
My simulation seems frozen just after loading Ruby/Opal. Whats wrong?
AKA My simulation seems slow just after loading Ruby/Opal This is normal. Depending on the protocol choice, configuration, and workload, running Simics with Ruby will slow down by about 125x (measured). We think this is due to inefficiencies in Simics' stall processor models. Adding Opal worsens the slowdown to about 300x (most of this overhead goes to SIM_continue API calls).
What ISAs are supported by GEMS?
- GEMS is intended to support the UltraSPARC III+'s ISA, which is a superset of SPARCv9. Opal is extremely tied to this ISA implementation. Ruby is less coupled to the SPARC ISA, and can be used with x86-based targets with some modification (there is a patch for GEMS 1.2 in the downloads directory).
Why does the Ruby ''instructions_per_cycle'' metric differ from the CPI I get in Opal's output?
- The OPAL_RUBY_MULTIPLIER is simply a frequency multiplier between Ruby's and Opal's frequencies. If you set it to 1, Opal and Ruby have the same frequency, higher Opal will have a faster frequency. The multiplier was added as a performance enhancement. The assumption is Opal needs a high frequency to model a fast out-of-order processor, but Ruby can use a lower frequency because it models the relatively slow memory system. Since Ruby is an eventqueue driven simulator, having a slower frequency in Ruby results in less 'wakeup' cycles and thus reduces simulation time. However, when using a non-unit multiplier, you need to remember the relation when analyzing the stats.
What memory consistency model does GEMS support?
- Opal was written more to get memory references out in parallel than to be an accurate out-of-order processor micro-architecture. It lets loads and stores out out-of-order, but uses Simics in such a way that any violations of SC functionality be detected as "deviations," causing Opal state to be reloaded. In the release there are no constraints as to when loads and stores can execute. The only constraint is the load-store dependencies checked within the LSQ (which results in pipeline flush and reexecution of the violating load). Stores and loads execute whenever their source operands are ready and the address is calculated, and violations are detected at retirement when Opal checks with Simics on register values. Any deviations result in a pipeline flush and restart on the instruction following the violating instruction. Currently Opal does store permission prefetching, when the address is known but the source data value has not yet been computed. Speculative loads are permitted, and the appropriate LSQ and MSHR entries are squashed (and the pipeline flushed) on misspeculation. The MSHR merges loads and stores to the same line address in the same entry, which will be a problem if multiple outstanding stores to the same address occur because the memory system will not see separate requests, just the first request. The LSQ is not snooped in any way, so there is no mechanism to inform a processor of consistency violations.
How can I tell what the latency is between two points in Ruby's interconnect?
- If you set PRINT_TOPOLOGY to true, you can see the latencies that Ruby calculates.
Why am I getting undefined references to SIM_* functions at link-time?
Part of the Ruby build process is to build the random tester, which links against all of Ruby's code to form a standalone executable. The SIM_* functions are provided by Simics, and are only callable when running Ruby as a Simics module. When linked against the tester, we provide dummies of these SIM_* API calls in $GEMS/ruby/simics/simics_api_dummy.c. Just add a definition for your missing functions there, and everything should compile and link cleanly.
How do I set the memory latency?
The latency returned by the Directory/Memory contollers is primarily determined by the parameter found in $GEMS/ruby/config/rubyconfig.defaults called MEMORY_RESPONSE_LATENCY_MINUS_2. A random number between 0 and 4 generated on each response and added to this parameter. This is used to address workload variability as discussed in Alameldeen's 2003 HPCA paper.
How do I change the L1 cache access latency?
By default, the hit latency for L1 caches is always 1-cycle and cannot be any different for the *SMP* protocols. For the CMP protocols, the L1D hit latency can be changed through non-intuitive parameters. First, the REMOVE_SINGLE_CYCLE_DCACHE_FAST_PATH parameter must be changed to true. This will cause all data accesses to be issued from the Ruby Sequencer to the L1 cache controller. Then, the SEQUENCER_TO_CONTROLLER_LATENCY latency should be set as the desired L1 hit latency.
How do I change the L2 cache access latency?
For the SMP protocols, the L2 hit latency is determined by the SEQUENCER_TO_CONTROLLER_LATENCY. The REMOVE_SINGLE_CYCLE_DCACHE_FAST_PATH must be set to false. For the CMP protocols, the L2 access latency is controller by the parameter used by the SLICC specification file. Usually this parameter is L2_RESPONSE_LATENCY.
When I change a certain XX parameter in rubyconfig.defaults, I see no difference
Many of the parameters are only used by certain protocols. For example, many of the SMP protocols use the CACHE_RESPONSE_LATENCY when responding to a forwarded request. But the CMP protocols often use other parameters such as L1_RESPONSE_LATENCY and L2_RESPONSE_LATENCY.
Many of the parameters apply to all protocols. For example, the g_SEQUENCER_OUTSTANDING_REQUESTS parameter is used by the Ruby Sequencer ($GEMS/ruby/system/Sequencer.C) and applies to any protocol (but raising this parameter does nothing with Opal's ability to generated multiple outstanding requests per processor).
Why doesn't my simulation complete when running with Ruby?
- See the slowdown numbers in the tutorial slides for an idea of how long things should take. From Dan Gibson:
As far as the question about Ruby's performance overall (which seems to come up rather frequently), I'll give some general remarks: In general, Ruby's performance degrades with: - Increases in processor count - More complex networks, like DNUCA - Longer memory latencies, and more memories - Larger target address spaces - More complicated protocols - More contention in the target workloads - Lots of other things. As you might expect, Ruby's performance also depends on whether Opal is used, what version of Simics is used, what compiler was used, what the host's performance is, and lots more. I'm not able to say with any certainty, for a given combination of host machine, target configuration, memory sizes, etc. what the expected slowdown of Ruby will be, except that it will almost always be significant. Here are a things to do that sometimes improve simulator performance: 1) Reduce memory and cache latencies 2) Look into the simics command set-memory-limit to reduce Simics' footprint 3) Use simple networks and protocols 4) Be careful to only simulate the important regions of code -- do not include thread creation / destruction, process forking, etc. 5) Before loading ruby, do a dry-run of your algorithm to warm up the TLBs and the OS file cache, if necessary.