Service Level Agreement management with LUA scripting

Introduction

The SLA management feature is available in the LoriotPro Extended Edition. With this set of capabilities LoriotPro can help you to check that the current quality of your information system (availability and performance counters) follow a predefined level of service, a SLA.

Quality of service indicator is expressed in % of something (availability, performance) over a defined time period, like a day or a week. For example, 99 % of availability over a one week period means that a host was not reachable for a cumulated time of 100 seconds. 99% of performance over one week period means that only for 100 seconds the response time was higher than the predefined value set by the agreement.

The SLA support offered by LoriotPro software provides you predefined SLA report. You can also define your own report and automate the report generation by the use of script written in LUA language embedded in the LoriotPro extended Edition.

One of the goal of LoriotPro is to monitor the availability of the hosts defined in the directory. This task is performed by a LoriotPro embedded module called the poller process. The poller process sends packets (icmp Ping of snmp Ping) at regular intervals to each host that is monitored.

If the host configuration allows SLA data collection, the results of this permanent polling are stored in a local database. This database contains two basic information: does the host answer to our requests and how long it takes to get these answers. These two data will allow us to check if the host availability and the performance follow a predefined quality of service, our Service level agreement.

The SLA database is composed of simple daily text files that contain the returned values of our ping requests and for each of them the timestamp. Database file name are also time stamped to ease the analysis of data over a time period.

The LUA scripting language and the SLA dedicated script functions will allow you to exploit the database and create your own SLA reports.

Starting the SLA data collection

The SLA data collection is started in the host configuration screen.
Select a host in the directory and click on the properties icon or select the property option from the contextual menu.In the list of tab select the SLA tab and check the "Active SLA" checkbox. That's it.

Service level agreement activation

SLA Database architecture

If you plan to use the data and the script function for SLA management you should understand the structure of the SLA database set by LoriotPro.

LoriotPro use a directory structure to store the SLA database files. The SLA directory is located in the /bin directory of the LoriotPro directory installation path.

sla files

By default LoriotPro is installed in the :

C:\Program Files\LUTEUS\LoriotPro V4\

And the SLA database files are stored in \bin\SLA by default :

C:\Program Files\LUTEUS\LoriotPro V4\bin\SLA.

The first sub directory level identifies the LoriotPro software that performed the data collection. Every installed LoriotPro software receives a unique ID.

The underneath level contain sub directory named with the IP address of the hosts that have the SLA collection activated. The SLA activation is performed host by host in each host advanced parameters. By default the poller process of LoriotPro does not store SLA data.

In each IP address named directory, LoriotPro create the database file, one for each day of data collection.

These directories can also include other directories that are named with the port number of the application or a URL. This capability will be used in future development.

sla directory

The file structures are always the same and allow a fast and simple data analysis. File do not contain host reference only data.

SLA database file structure

The database file structure follows strict rules File name coding 

Database files use the current date as name, one file per day of the year. These files are text file encoded.

 Year_Month_Day.txt (Year 4 digits, Month 2 digits, Day 2 digits)

sla file

 In the upper example, the SLA database file from the 22 of October 2005 for the host with IP address 123.1.1.1 for a icmp/snmp on the LoriotPro with ID 1007.

SLA database files description

A file is composed of successive line; each line has the following format:

timestamp;polling type;Response time or status information 

1110629680;1;start
1110629680;2;start
1110629697;2;16
1110629714;2;15
1110629733;2;0
1110629750;2;0
1110629767;2;0
1110629783;2;0
1110629800;2;0
1110629817;2;15
1110629833;2;0

 timestamp is the time of the sample (number of second elapse since 1970) 

Polling type defines what kinds of packet or request are used for the polling

Numro

Type de polling

1

Icmp - standard Ping

2

Snmp - A simple snmp request on sysname

3

Tcp - A request on a listening TCP application

4

Udp - A request on a listening UDP application

5

url - A URL access

..

If PING an SNMP are used for polling both 1 and 2 types may appear in the file.

Response time or status information

 If the host polled answers, the response time in milliseconds is stored.

Else a status message can be also present.

Label

description

start

Notify that the SLA collection starts(or programme dmarr 'mini -bk')

stop

Notify that the SLA collection stops

stop_polling

The polling global process of LoriotPro is stopped

start_polling

The polling global process of LoriotPro is started

Stop_repair

(mini -bk) end of repair period

Start_repair

(mini -bk) beginning of a repair period

Stop_loriot

(mini -bk) LoriotPro is stopped

Start_loriot

(mini -bk) LoriotPro start

-1

There was no answer from the host

35

Example of response time in milliseconds.

How to estimate the polling interval

1129991441;1;start
1129991441;2;start
1129991450;1;-1
1129991450;2;-1
1129991466;1;-1
1129991466;2;-1
1129991482;1;-1
1129991482;2;-1
1129991498;1;-1
1129991498;2;-1
1129991514;1;-1

In the upper example, the SLA collection start in type 1 and 2 (PING icmp and PING snmp) and the polling interval is :

1129991466 - 1129991450 = 16 seconds

In this case we should take the time stamp of the line 5 and line 4 which are good polling entries, to be able to find the polling interval. The time stamp of status information like start and stop is not linked to the polling period and cannot be used to calculate the polling interval.

In the next example below the host stop to respond and the snmp polling thus the poller process switch to icmp polling (double polling activated in the host properties).

The double polling is one of the poller features. The polling can use either snmp ping or icmp ping. If both polling methods are turn on only the snmp polling is used but if the host fails to answer to the snmp polling then the icmp polling is used. A host snmp agent can be stopped thus snmp request are not longer satisfied but the host is still working and available.

The host is answering to snmp ping request, polling type is 2.

1110103926;1;start
1110103926;2;start
1110103943;2;0
1110103960;2;15
1110103979;2;32
1110103996;2;0
1110104013;2;0
1110104029;2;0
1110104046;2;15
1110104062;2;0
1110104079;2;0

The host stops to answer to snmp polling (2) but still answers to ping polling (1).

If the administrator stops the SLA collection for this host, stop status message is recorded.

1129991866;2;-1
1129991882;1;-1
1129991882;2;-1
1129991895;1;stop
1129991895;2;stop
1129991920;1;start
1129991920;2;start

Abnormal collection, holes in the data collection

If there are missing records at some time intervals it is probably a process crash. Restart the SLA for this host

1110103926;1;start
1110103926;2;start
1110103943;2;0
1110103960;2;15
1110103979;2;32
1110103996;2;0
1110104013;2;0
1110104029;2;0
1110104046;2;15
1110104062;2;0
1110104079;2;0

In the example the SLA start at 1110103926

The polling interval is 1110103960 - 1110103943 = 17 seconds but we can see with the next value here that the polling interval is moving. LoriotPro is may be unable to perform the polling due to an overload.

Abnormal termination of LoriotPro  

If LoriotPro stops due to a system crash, the database file is not close correctly.

1110103926;1;start
1110103926;2;start
1110103943;2;0
1110103960;2;15
1110103979;2;32
1110103996;2;0
1110104013;2;0
1110104029;2;0
1110104046;2;15
1110104330;2;0
1110104348;2;0
1110104365;2;0
1110104381;2;0
1110104397;2;0
1110104415;2;15
1110104432;2;0
1110105908;1;start
1110105908;2;start
1110105925;2;16

In the upper example there is a hole of collection between 1110105908 (start) and 1110104432.

The polling process can be stopped

sla polling

In this case the start_polling status information is used

1129993499;2;-1
1129993507;1;stop_polling
1129993507;2;stop_polling
1129993511;1;start_polling
1129993511;2;start_polling
1129993515;1;-1

The double polling can be stopped in the host properties

Example with on polling type stopped

1129993675;1;-1
1129993675;2;-1
1129993677;2;stop
1129993691;1;-1
1129993707;1;-1
1129993723;1;-1

Summary

The SLA database files structure is more complex when the double polling (ping and SNMP) is turn on. In the following example, various action performed in LoriotPro stop and start the SLA data collection.

1130404775;1;start
1130404775;2;start
1130404782;2;203
1130404795;1;stop_polling The global icmp polling is stopped
1130404795;2;stop_polling The globa lsnmp polling is stopped
1130404796;1;start_polling The global icmp polling is started
1130404796;2;start_polling Theglobal snmp polling is started
1130404803;2;0 Not usable value (host is responding)
1130404824;2;0
1130404845;2;0
1130404866;2;0
1130404907;1;start The SLA is started on icmp
1130404907;2;start The SLA is started on snmp
1130404920;2;250 not usable value
1130404941;2;0
1130404962;2;0
1130405362;1;start_loriot
1130405362;2;start_loriot
1130405383;2;781
1130405404;1;stop_repair
1130405404;2;stop_repair
1130405460;1;start_repair
1130405460;2;start_repair
1130405460;1;782 Not usable value
1130405460;2;0
1130405481;2;15
1130405502;2;0
1130405524;2;-1 no response from host
1130405545;2;0
1130405576;2;0
1130405731;1;start_loriot LoriotPro start (icmp)
1130405731;2;start_loriot LoriotPro start (snmp)
1130405752;2;16 Not usable value
1130405763;1;stop The icmp polling is stopped
1130405773;2;0
1130405781;1;start The icmp polling is started
1130405794;2;16 Not usable value
1130405808;1;stop The icmp polling is stopped
1130405815;2;0
1130405821;2;stop_repair The host snmp polling is stopped due to a maintenance period
1130405880;2;start_repair The host snmp polling is start after a maintenance period
1130405880;2;0 Not usable value
1130405901;2;0
1130405925;2;16
abnormal temination
1130406241;2;start_loriot LoriotPro start (snmp
1130406271;2;954 Not usable value
1130406291;1;stop_loriot LoriotPro is stopping (end of icmp polling)
1130406291;2;stop_loriot LoriotPro is stopping (end of snmp polling) 

 Remark: After each start the provided value is not usable because the interruption time is totally random.

Using LUA script function to exploit SLA database

Introduction

The LoriotPro embedded LUA script language can be use to exploit the data store in the SLA database.

Among the function provided, you will find:

A function that lists the LoriotPro software that are collecting SLA data.

A function that list the host that have SLA data

A function that compute the SLA indicator over a time period

The functions are store in the lpsla library that should be included in all LUA script.

V400 b138 SP0-cf 31 mai 2006 :
ADD lua package: sla library

The lua_lp_sla.dll file is the library and should be added in each LUA script file that should use the SLA functions.

if (lp.IsDebugMode()==1) then
lib,init=lp.LoadLibrary(lp.GetPath().."/lua_lp_slad.dll","libinit");
else
lib,init=lp.LoadLibrary(lp.GetPath().."/lua_lp_sla.dll","libinit");
end

List of SLA function available in the LUA script language

number=lpsla.GetLoriotProIDList('array');

This function get the list of the LoriotPro software involved in SLA data collection. Each LoriotPro software is identified by its unique ID

'array' an array of the available LoriotPro software ID.

number : The number of directory available

array[0] .. array[number-1]

number=lpsla.GetSLAList('LoriotProID','array');

This function provide the list of Host by IP address that have SLA data available for a specific LoritoPro software ID.

'LoriotProID' : An ID (The ID is define in the license information file /bin/licence.ini)

'array' : An array with the list of host IP address available for this LoriotPro software ID.

number : The number of available hosts with the SLA feature activated.

array[0] . array[number-1] The list itself

value=lpsla.Compute('id','sla_rep',Syear,Smonth,Sday,Eyear,Emonth,Eday,STime,ETime,RTT_Threshold,Avaibility,Performance,'array')

This function calculates the current level over a time period. The return value is a table with the calculated values.

 

'id' : a LoriotProID

'sla_rep' : the directory where database file are store (ID/SLA)

Syear : The starting year

Smonth : The starting Month(1 - 12)

Sday: The starting day1 - 31)

Eyear : The ending year

Emonth : The Ending Month

Eday : The Ending day

STime : The timestamp on the beginning time (OS binary format: os.time{year=2006,month=5,day=30,hour=0})

ETime : The timestamp on the ending time(OS binary format : os.time{year=2006,month=6,day=30,hour=0}

RTT_Threshold : The limit defined by the agreement for the response time

Avaibility : The rate limit defined for the availability

Performance : The performance limit expectation

'array' : An array of calculated data

The array has the following structure

Return value

Description

array.ip

The IP address IP of the host

array.name

The host name

array.polling_type

The polling type

1 = icmp

2 = snmp

array.periode

In percent

The effective value over the time range.

Warning: if the double polling is set this number can be higher than 100 %

array.avaibility

The percentage of good responses over the time range.

(en tenant compte uniquement de la priode de rsultat)

array.performance

The percentage of correct response time over the time range.

The RTT_Threshold is used to know if reponse time is correct (under the threshold) or bad.

Percentage = percentage de rponse < RTT

 

array.total_collected

Number of sample in the database over this time range.

array.total_waited

Number of sample that should be collected over this time range.

Warning: when double polling is set the number of collected sample can be higher than the maximum possible value.

Example of LUA script that performs the SLA calculation

////////////////////// sample

if (lp.IsDebugMode()==1) then

lib,init=lp.LoadLibrary(lp.GetPath().."/lua_lp_slad.dll","libinit");

else

lib,init=lp.LoadLibrary(lp.GetPath().."/lua_lp_sla.dll","libinit");

end

if (lib) then

init();

id="1002";

k=lpsla.GetLoriotProIDList("a");

for l=0,k-1 do

lp.Print(a[l]," LoriotPro ID \n");

i=lpsla.GetSLAList(a[l],"aa");

if i then

for j=0,i-1 do

lp.Print("\t",aa[j]," SLA \n");

--Compute('id' ,'sla_rep', Syear,Smonth, Sday, Eyear, Emonth, Eday, STime, ETime, RTT_Threshold, Avaibility, Performance, 'array')

if lpsla.Compute(a[l],aa[j],2005,5,1,2006,6,30,os.time{year=2005,month=5,day=1,hour=0},os.time{year=2006,month=6,day=30,hour=0},50,90,90,'array') then

lp.Print("\t\tip : ",array.ip,"\n");

lp.Print("\t\tname : ",array.name,"\n");

lp.Print("\t\tpolling_type : ",array.polling_type,"%\n");

lp.Print("\t\tperiode : ",array.periode,"%\n");

lp.Print("\t\tavaibility : ",array.avaibility,"%\n");

lp.Print("\t\tperformance : ",array.performance,"%\n");

lp.Print("\t\tgood_polling : ",array.good_polling,"\n");

lp.Print("\t\ttotal_collected : ",array.total_collected,"\n");

lp.Print("\t\ttotal_waited : ",array.total_waited,"\n");

end

end

end

end

end

To run this script on a set of hosts, use the Host Bulk configuration Plugin.

 

calling sla

 

sla results

In the following example, 3 hosts are not set for SLA collection. The host 127.0.0.1 is correctly set up but the number of expected samples (total_waited) is superior to the effective sample collected. However there was no loss in the collection, but this can be explain by a change in the polling interval (decrease) over this time range.

Script used

-- Display SLA for DAY --

-- To run correctly this file is located to bin/config/script

-- Input values

-- lp_index index for this script ".1"

-- lp_oid SNMP OID for this script "ifnumber"

-- lp_host default ip address for this script "127.0.0.1"

-- Output Values

lp_value = 0;

lp_buffer ="error";

-- use this to initialise the host selection

dofile(lp.GetPath().."/config/script/bulk/selection/LP_Selection.lua")

dofile(lp.GetPath().."/config/script/lib-audit/1-audit.lua");

-----------------------------------------------------------------------------------------------

-- Start program

-----------------------------------------------------------------------------------------------

--list the ip host to scan

tabz={};

hostnumber=LP_HostsSelection(tabz);

if hostnumber==0 then error("Not host selected\n") end

if (lp.IsDebugMode()==1) then

lib,init=lp.LoadLibrary(lp.GetPath().."/lua_lp_slad.dll","libinit");

else

lib,init=lp.LoadLibrary(lp.GetPath().."/lua_lp_sla.dll","libinit");

end

if (lib==nil) then error("SLA Lib Not found or not loaded\n") end;

init();

lp.Print("Display SLA for day\n");

temp=os.date("*t",os.time());

--[[

temp.year

temp.month

temp.day

temp.hour

temp.min

--]]

lp.Print(string.format("\tyear %i month %i day %i\n",temp.year,temp.month,temp.day));

for i=0,table.getn(tabz) do

info={};

rep=lp.GetIPInformation(tabz[i],"array");

if rep then

if array.sla==1 then

lp.Print(string.format("------------------------------------------\nHost %s\nIP add : [%s]\n\n",array.name,tabz[i]));

if lpsla.Compute(100001,tabz[i],temp.year,temp.month,temp.day,temp.year,temp.month,temp.day

,os.time{year=temp.year,month=temp.month,day=temp.day,hour=0}

,os.time{year=temp.year,month=temp.month,day=temp.day,hour=0}

,50,90,90,'array') then

lp.Print("\t\tIP : ",array.ip,"\n");

lp.Print("\t\tName : ",array.name,"\n");

lp.Print("\t\tpolling_type : ",array.polling_type,"\n");

lp.Print("\t\tCollect for periode : ",array.periode,"%\n");

lp.Print("\t\t\tAvaibility : ",array.avaibility,"%\n");

lp.Print("\t\t\tPerformance : ",array.performance,"%\n");

lp.Print("\t\tGood_polling : ",array.good_polling,"\n");

lp.Print("\t\tTotal_collected : ",array.total_collected,"\n");

lp.Print("\t\tTotal_waited : ",array.total_waited,"\n");

end
else

lp.Print("SLA no collected for host : ",array.ip," \n");

end

end

end

lp.Print("Scan Ended\n");

lp_buffer ="ok";

end

end

lp.Print("Scan Ended\n");

lp_buffer ="ok";