redbug

intro

The erlang tracing (the so-called "trace bifs", erlang:trace/3 and erlang:trace_pattern/3) are insanely powerful. For example, imagine this problem;

We have a system using 10,000 processes. Due to a bug, some rogue code occasionally calls a function with some garbage data. This eventually causes total system failure. We can not easily load code onto the system since it sits at a customer site.

Using the trace bifs, we can instruct the VM to set a watchpoint on the called function, a watchpoint that only triggers when the function is called with the bad value, and upon triggering inspects the call stack and tells us which piece of code issued the bad call.

Unfortunately, the interface to the trace bifs is geared towards power, and not is not particularly user friendly. A layer on top of the bifs is needed.

dbg

The standard OTP solution is called dbg. Alas, I must recommend against using this on live systems, since it is way too easy to get it wrong.

Try this for a laugh (it will kill the node).

5> dbg:tracer(),dbg:p(all,m).

Another problem with dbg is that it does not lift the level of abstraction enough (from that of the trace bifs).

Things look grim, what to do?

redbug aims to be a safe and simple interface to the erlang trace functionality, with some extra features thrown in.

The simplicity comes from exposing only the most useful (i.e. the bits I use often) parts of the API in a concise way.

The safety comes partly from the simplicity, but also from the way the redbug process works.

redbug supervises itself and will kill itself if it deems that it is in danger of overloading the node. It also delays the processing of the trace messages until after the trace is done.

The most interesting extra feature is probably the stack match spec. Answers the eternal question: "Who the * is calling my function with this * argument?"

Another potentially interesting feature is remote tracing; i.e. tracing on a node other than the one we run redbug on.

interface

The interface consists of the function start/2.

redbug:start(Trc,Opts)
Returns: {Procs,Funcs} | {argument_error,Err} | {syntax_error,Err}
          Procs: number of matched processes
          Funcs: number of matched functions
          Err: error descriptor
If the blocking option is given, start/2 will return a list of trace messages.


Trc: list('send'|'receive'|string(RTP))
RTP:  restricted trace pattern
  the RTP has the form: "<mfa> when <guards> -> <actions>"
  where <mfa> can be;
  "mod", "mod:fun", "mod:fun/3" or "mod:fun('_',atom,X)"
  <guard> is something like;
  "X==1" or "is_atom(A)"
  and <action> is;
  "return" and/or "stack" (separated by ";")

  E.g.
  ets:lookup(T,hostname) when is_integer(T) ->stack

Opts: list({Opt,Val})

  general opts:
time         (15000)       stop trace after this many ms
msgs         (10)          stop trace after this many msgs
target       (node())      node to trace on
arity        (false)       print arity instead of arg list
blocking     (false)       block start/2, return a list of messages
procs        (all)         (list of) Erlang process(es)
                           all|pid()|atom(RegName)|{pid,I2,I3}

  print-related opts
max_queue    (5000)        fail if internal queue gets this long
max_msg_size (50000)       fail if seeing a msg this big
buffered     (no)          buffer messages till end of trace
print_calls  (true)        print calls
print_file   (standard_io) print to this file
print_msec   (false)       print milliseconds on timestamps
print_depth  (999999)      formatting depth for "~P"
print_re     ("")          print only strings that match this

  trc file related opts
file         (none)        use a trc file based on this name
file_size    (1)           size of each trc file
file_count   (8)           number of trc files

examples

Trace on all functions in the module ets, timeout after 1 second.


6> redbug:start("ets",{time,1000}).
{38,136}

17:43:55 <{erlang,apply,2}> {ets,lookup,
                                 [ac_tab,{env,stdlib,shell_history_length}]}

Trace on ets:lookup, timeout after 1 second.


7> redbug:start("ets:lookup",{time,1000}).
{38,1}

17:43:55 <{erlang,apply,2}> {ets,lookup,
                                 [ac_tab,{env,stdlib,shell_history_length}]}

Trace on ets:lookup/2, match on part of the second argument, timeout after 1 second.


8> redbug:start("ets:lookup(_,{_,_,shell_saved_results})",{time,1000}).
{38,1}

17:45:27 <{erlang,apply,2}> {ets,lookup,
                                 [ac_tab,{env,stdlib,shell_saved_results}]}

Trace on calls to ets:lookup and its return values, timeout after 1 second, max two messages.


9> redbug:start("ets:lookup->return",[{msgs,2},{time,1000}]).      
{38,1}

18:06:26 <{erlang,apply,2}> {ets,lookup,
                                 [ac_tab,{env,stdlib,shell_history_length}]}
                
18:06:26 <{erlang,apply,2}> {ets,lookup,2} -> []

Trace on ets:lookup/2, match first argument, guard on the second argument, timeout after 1 second, max two messages.


10> redbug:start("ets:lookup(ac_tab,X)when not is_tuple(X)->return",[{msgs,2},{time,10000}]).
{38,1}

18:09:27 <{erlang,apply,2}> {ets,lookup,[ac_tab,[]]}
18:09:27 <{erlang,apply,2}> {ets,lookup,2} -> []

Trace on calls to ets:lookup and its return values, timeout after 1 second, max two messages. Display call stack.


11> redbug:start("ets:lookup->stack,return",[{msgs,2},{time,1000}]).
{38,1}

18:05:02 <{erlang,apply,2}> {ets,lookup,
                                 [ac_tab,{env,stdlib,shell_history_length}]}
  shell:eval_loop/3 
  shell:eval_exprs/7 
  shell:exprs/7 
  shell:result_will_be_saved/0 
  shell:get_history_and_results/0 
  shell:get_env/2 
                
18:05:02 <{erlang,apply,2}> {ets,lookup,2} -> []

misc

The call stack displayed by redbug is really a call return stack; i.e. only functions that will be returned to are shown. This is a feature of the emulator (tail call optimization), and cannot be circumvented in a safe way (as far as I know).

redbug has the idiotic/awesome feature that it can be run from a Unix shell. actually pretty handy if you're working against an embedded system.


bash$ ./priv/bin/redbug foo@localhost "ets:lookup"

22:38:52 <{erlang,apply,2}> {ets,lookup,
                                 [ac_tab,{env,stdlib,shell_history_length}]}
redbug done, timeout - 7