vinaitheerthan


Tutorial on Introduction to biostatistics

Table of contents

Introduction to SAS Programming

Statistical Analysis System (SAS) software became an integral part of any data analysis process especially in the clinical research domain.

SAS programming environment consists of two main steps namely DATA step and PROC step. Data step is the starting point of creating data sets for the SAS programs from the raw data. PROC step helps us to analyze the data and generate the desired output of the analysis.

We will start with the first step namely the Data step.     

1. Data Step

Data step has the following the two types of syntax

The first type of data step statement reads the data directly from the user input through input and datalines statements

data <dataset name>;
input <variable list>;
datalines;

Example 1a

data dataset1;
input age name $ weight;
datalines
23 name1 76
;

 
The second type of data step statement includes an infile statement.  It reads the data from the external files sources such as dat, spss and excel files.

data<dataset name>;
infile<path of the external data file>
input<variable list of the external file>;

Example 1a

data dataset1;
infile ‘d: \sasdata.dat’;
input age name $ weight;
run;

 
1.1 libname statement

 In the above the data set which is created will be stored temporarily in the SAS working environment. It will be deleted once we exit the SAS.  To store the data set permanently “libname” statement is used. The following is the syntax of the statement

 libname <name of the library to be referred termed as libref> <path of directory where the data set is to be stored> 

Example 1.1

 libname sdata “d:\sasdataset”

In the above example the ‘sdata’ is the library reference which will be used to indicate the directory in which the data set is being stored permanently. If we don’t use the libname statement then SAS will store the data set in the temporary working directory which is referred the library name ‘work’ (libref). We don’t need to mention it explicitly as it will be taken care by SAS.

 Additional option can be used in the file statements with delimiter (‘,’ ‘;’ ) to indicate that the variables are separated by delimiters not by the spaces which is the default option.

 1.2 Input Statement

Input statement is the common one in both of the above types of the data step which takes the following types of commonly used formats: list, column and formatted input statement

a. Input statement with list format

The simplest form of input statement is the list form where the variables are list and character or string variables are indicated by $ sign wherein the data values represented in the infile statement are simply separated by spaces. In this format the missing values to be represented by ‘.’ and the string variables can not have spaces and the length of the string variable is 8 only

Example 1.2a

data sdata.dataset1;
Input age name $ weight;
datalines
28 sn1 56

 

b. Input statement with column format

This type of input statement overcomes the drawbacks of the list format by indicating the starting column and ending columns of the data values in the infile statement and if the variables contain only one value then only one column is indicated.

Example 1.2b

Input age 1-3 name $ 5-25 weight 37-30;

c. Formatted input statement

The variables in the input statement are represented by its input formats called “informats” in SAS. The variables in the input statements are followed by the format and ‘.’

For example the string variables will be followed by $ sign, number of storage spaces it takes followed by ‘.’.  

The numeric variables will be followed by the number of digits say 5 plus if the numeric variable has 2 decimal places it will be denoted by 5.2.

Similarly the date variables will be represented by date format such “mmddyyyy” or “ddmmyy”.

1.3 Set Statement

If we want you use the existing SAS data set then the set statement is used in the data step
 

Example 1.3

data sdata.dataset2
set sdata.dataset1 

 
In the above statement the new data set ‘dataset2’ is created from the existing SAS data set ‘dataset1’.

1.4 Modifying the SAS data set

Assignment statement helps us to create and modify the existing variables and also it helps us to modify values of the variables in the SAS data set.

1.4.1 Assignment statement

To create a new variable the following syntax is used

New variable = mathematical expression of existing variable(s)

Example 1.4.1

profit = income – expenditure

1.4.2 If Then Else statement

If we want to modify the existing values of a variable or assign new values based on values of some other variables if then and else statement can be used. The following is syntax of the if then statement

If <condition> then <option1> else <option2>

Condition in the if statement can include the logical and comparison operators  such as

a) equal to (=)
            b) not equal to (ne)
            c) less than (<)
           d) greater than (>)
           e) and (&)
           f) or(|)
           g) not

Example 1.4.2

if hb > 14 and gender =’M’ then status = ‘normal’ else status = ‘week’;

The if statements to be used along with the input or set statements other wise ‘variable un initialized’ error will be shown by the SAS log

1.4.3 do end statement

If we want to repeat or execute of group of statements it can be placed with in the do end statement. The syntax of the do statement to execute group of statements
 

if hb > 14 then do
age = 100;
gender = f;
end;

The above statement will assign age = 100 and gender=f for all the observations where hb > 14.

1.4.4 Arrays

Arrays used to store variables with similar structure. Arrays can be used to carry out similar actions on the variables using do end statement. Array statement has the following syntax

array <variable name>{number of variables} <variable name><minimum number> - <variable name> <maximum number>

 

Example 1.4.4

array a{10} a1-a10;
do i = 1 to 10
a{i} = 4;
end;

 
The above example will assign value 4 to all the 10 variables.

1.4.5 Deleting Variables

If we want to delete a variable in the data set drop or keep statements can be used.
The drop statement is used to drop the variables from the data set and its syntax is as follows

drop <variable1 variable2 ..>

The keep statement does the same in the reverse which will keep only the variables needed in the data set

keep <variable1 variable2..>

both the statement has to be used in the data step. 

Example 1.4.5.1

data sdata.dataset2; keep age weight;

 

Example 1.4.5.2

data sdata.dataset2; drop name;

 

1.4.6 Deleting observations

 We can delete observations of a particular variable using delete statement. Delete statement is used mostly with if statement.
 

Example 1.4.6

If age < 1 then delete

 
Here the observations with less than age 1 will be deleted. It has to be used with the data step statements

1.4.7 Sub setting or splitting data sets

 
If want to split the data set into groups we can use the set statement along with the if condition to spilt the data sets into groups.  

Example 1.4.7

data sdata1. diabetic;
set sdata1.dataset1;
if dstatus = ‘y’ ;
run;

1.4.8 Concatenating or combining data sets

 If we want to combine two data sets it can be done with the help of set statement.

Example 1.4.8

data sdata1.overall;
set sdata1.diabetic sdata1.nondiabetic
run;

1.4.9 Merging data sets or adding variables to the existing data sets

If want to merge two data sets, the two data sets should have one common variable.

The syntax of the merge statement is as follows

data <new dataset>
merge <dataset1> <dataset4>;
by <identifier variable>;

 

Example 1.4.9

data dataset5
merge dataset3 dataset4;
by patientno;

 

2. PROC STEP

Proc step is next component of SAS programming environment. It helps us to analyze the data sets which are brought into the SAS environment.  It has the following syntax

 proc <name of the procedure> data= <name of the dataset>;
<procedure statement 1 >
<procedure statement 2 >
…..
<procedure statement n >
<run>

 

If we don’t specify the data set SAS will consider the latest data set for the analysis purpose.

SAS has number of procedures statistical procedures which are used to do the analysis at the univariate and multivariate levels. Procedures helps us to carry out descriptive, inferential analysis, predictive and building the different statistical models.

The following are the some of the statements used in the procedure statements

2.1  var statement

2.2  by statement

2.3  where statement

2.4  class statement

2.1 var statement

The var statement specifies variables to be used by the procedure step.  The syntax of the var statement is

var <variable name1 variable name2 ….>

Example 2.1

var name age weight;

2.2 where statement

The where statement specifies which observations to be used by the procedure step based on the logical conditions provided.  The syntax of the where statement is 

Where <logical expression >

 

Example 2.2

where age >30 and weight < 60;

 

 2.3 by statement

 

The by statement is used to analyze data in groups.  The variable in the by statements should be categorical. The syntax of the by statement is

by <variable name1 variable name 2 > 

Example 2.3
var weight;
By gender;

 2.4 class statement

The class statement is similar to by statement except it can contain numeric variables with lesser groups and output will be shown in a single table whereas in the by statement it will be shown in separate table for each of the groups. The syntax of the class statement is

class <variable name1 variable name 2 >

Example 2.4

var age;
class weight;


3. SAS Procedures

The following chapters discusses some of the commonly used SAS procedures related data preparation, descriptive analysis, inferential analysis and mode building, report, graph and output preparation

 3.1  Data Preparation level procedures

3.1.1  proc sort

3.1.2  proc format

3.1.3  proc transpose 

3.2  Descriptive analysis

3.2.1  proc freq

3.2.2  proc means

3.2.3  proc univariate

3.2.4  proc summarize

3.2.5  proc tabulate

3.2.6  proc corr

3.3  Inferential analysis

3.3.1  proc anova

3.3.2  proc reg

3.3.3  proc ttest

3.3.4  proc glm

3.3.5  proc genmod

3.3.6  proc mixed

3.3.7  npar1way 

3.4  Report, graph and output preparation

3.4.1  proc print

3.4.2  proc content

3.4.3  proc plot

3.4.4 proc gplot

3.4.5  proc boxplot

 

 

 

 

 

 

 

 

 

 

Tutorial on Introduction to biostatistics

Table of contents