Tutorial on Introduction to
biostatistics
Table of contents
Introduction
to SAS Programming
Statistical
Analysis System (SAS) software became an integral part
of any data analysis process especially in the clinical research domain.
SAS
programming environment consists of two main steps namely DATA
step and PROC step. Data step is the starting point of creating data
sets for
the SAS programs from the raw data. PROC step helps us to analyze the
data and
generate the desired output of the analysis.
We will
start with the first step namely the Data step.
1.
Data Step
Data
step has the following the two types of syntax
The
first type of data step statement reads the data directly from the user
input
through input and datalines statements
data
<dataset name>;
input
<variable list>;
datalines;
Example 1a
data dataset1;
input age name $ weight;
datalines
23 name1 76
;
The
second type of data step statement includes an infile statement. It
reads the data from the external files sources
such as dat, spss and excel files.
data<dataset
name>;
infile<path
of the external data file>
input<variable
list of the external file>;
Example 1a
data
dataset1;
infile ‘d: \sasdata.dat’;
input age name $ weight;
run;
1.1 libname statement
In
the above the data set which is created will be stored temporarily in
the SAS working
environment. It will be deleted once we exit the SAS.
To
store the data set permanently “libname”
statement is used. The following is the syntax of the statement
libname
<name of the library to be referred termed as libref>
<path of
directory where the data set is to be stored>
Example 1.1
libname
sdata “d:\sasdataset”
In
the above example the ‘sdata’ is the library
reference which will be used to
indicate the directory in which the data set is being stored
permanently. If we
don’t use the libname statement then SAS will store the data
set in the
temporary working directory which is referred the library name
‘work’ (libref).
We don’t need to mention it explicitly as it will be taken
care by SAS.
Additional
option can be used in the file statements with delimiter
(‘,’ ‘;’ ) to indicate
that the variables are separated by delimiters not by the spaces which
is the
default option.
1.2
Input Statement
Input
statement is the common one in both of the above types of the data step
which
takes the following types of commonly used formats: list, column and
formatted
input statement
a.
Input statement with list format
The
simplest form of input statement is the list form where the variables
are list
and character or string variables are indicated by $ sign wherein the
data
values represented in the infile statement are simply separated by
spaces. In
this format the missing values to be represented by
‘.’ and the string
variables can not have spaces and the length of the string variable is
8 only
Example 1.2a
data sdata.dataset1;
Input age name $ weight;
datalines
28 sn1 56
b.
Input
statement with column format
This
type of input statement overcomes the drawbacks of the list format by
indicating the starting column and ending columns of the data values in
the
infile statement and if the variables contain only one value then only
one
column is indicated.
Example 1.2b
Input
age 1-3 name $ 5-25 weight 37-30;
c.
Formatted
input statement
The
variables in the input statement are represented by its input formats
called
“informats” in SAS. The variables in the input
statements are followed by the
format and ‘.’
For
example the string variables will be followed by $ sign, number of
storage
spaces it takes followed by ‘.’.
The
numeric variables will be followed by the number of digits say 5 plus
if the
numeric variable has 2 decimal places it will be denoted by 5.2.
Similarly
the date variables will be represented by date format such
“mmddyyyy” or
“ddmmyy”.
1.3
Set Statement
If
we want you use the existing SAS data set then the set statement is
used in the
data step
Example 1.3
data
sdata.dataset2
set sdata.dataset1
In
the above statement the new data set ‘dataset2’ is
created from the existing
SAS data set ‘dataset1’.
1.4
Modifying the SAS data set
Assignment
statement helps us to create and modify the existing variables and also
it
helps us to modify values of the variables in the SAS data set.
1.4.1
Assignment statement
To
create a new variable the following syntax is used
New
variable = mathematical expression of existing variable(s)
Example 1.4.1
profit
= income – expenditure
1.4.2
If Then Else statement
If
we want to modify the existing values of a variable or assign new
values based
on values of some other variables if then and else statement can be
used. The
following is syntax of the if then statement
If
<condition> then <option1> else
<option2>
Condition
in the if statement can include the logical and comparison operators such
as
a) equal to (=)
b)
not equal to (ne)
c)
less than (<)
d)
greater than (>)
e) and
(&)
f)
or(|)
g) not
Example 1.4.2
if
hb > 14 and gender =’M’ then status =
‘normal’ else status =
‘week’;
The
if statements to be used along with the input or set statements other
wise
‘variable un initialized’ error will be shown by
the SAS log
1.4.3
do end statement
If
we want to repeat or execute of group of statements it can be placed
with in
the do end statement. The syntax of the do statement to execute group
of
statements
if hb > 14 then
do
age = 100;
gender = f;
end;
The
above statement will assign age = 100 and gender=f for all the
observations
where hb > 14.
1.4.4 Arrays
Arrays
used to store variables with similar structure. Arrays can be used to
carry out
similar actions on the variables using do end statement. Array
statement has
the following syntax
array
<variable name>{number of variables} <variable
name><minimum
number> - <variable name> <maximum
number>
Example 1.4.4
array
a{10} a1-a10;
do i = 1 to 10
a{i} = 4;
end;
The
above example will assign value 4 to all the 10 variables.
1.4.5
Deleting Variables
If
we want to delete a variable in the data set drop or keep statements
can be
used.
The
drop statement is used to drop the variables from the data set and its
syntax is
as follows
drop <variable1
variable2 ..>
The
keep statement does the same in the reverse which will keep only the
variables
needed in the data set
keep <variable1
variable2..>
both
the statement has to be used in the data step.
Example 1.4.5.1
data
sdata.dataset2; keep age weight;
Example 1.4.5.2
data
sdata.dataset2; drop name;
1.4.6
Deleting observations
We
can delete observations of a particular variable using delete
statement. Delete
statement is used mostly with if statement.
Example 1.4.6
If
age < 1 then delete
Here
the observations with less than age 1 will be deleted. It has to be
used with
the data step statements
1.4.7
Sub setting or splitting data sets
If
want to split the data set into groups we can use the set statement
along with
the if condition to spilt the data sets into groups.
Example 1.4.7
data
sdata1. diabetic;
set sdata1.dataset1;
if dstatus = ‘y’ ;
run;
1.4.8
Concatenating or combining data sets
If
we want to combine two data sets it can be done with the help of set
statement.
Example 1.4.8
data sdata1.overall;
set sdata1.diabetic sdata1.nondiabetic
run;
1.4.9
Merging data sets or adding variables to the existing data sets
If
want to merge two data sets, the two data sets should have one common
variable.
The
syntax of the merge statement is as follows
data
<new dataset>
merge
<dataset1> <dataset4>;
by
<identifier variable>;
Example 1.4.9
data
dataset5
merge dataset3 dataset4;
by patientno;
2.
PROC STEP
Proc
step is next component of SAS programming environment. It helps us to
analyze
the data sets which are brought into the SAS environment. It
has the following syntax
proc
<name of the
procedure> data= <name of the dataset>;
<procedure statement 1 >
<procedure statement 2 >
…..
<procedure statement n >
<run>
If
we don’t specify the data set SAS will consider the latest
data set for the
analysis purpose.
SAS
has number of procedures statistical procedures which are used to do
the
analysis at the univariate and multivariate levels. Procedures helps us
to
carry out descriptive, inferential analysis, predictive and building
the
different statistical models.
The
following are the some of the statements used in the procedure
statements
2.1
var
statement
2.2
by
statement
2.3
where
statement
2.4
class
statement
2.1
var statement
The
var statement specifies variables to be used by the procedure step. The
syntax of the var statement is
var
<variable name1 variable name2 ….>
Example 2.1
var name age weight;
2.2
where statement
The
where statement specifies which observations to be used by the
procedure step
based on the logical conditions provided.
The syntax of the where
statement is
Where
<logical expression >
Example 2.2
where
age >30 and weight < 60;
2.3 by
statement
The
by statement is used to analyze data in groups.
The variable in the by
statements should be categorical. The syntax of
the by statement is
by
<variable name1 variable name 2 >
Example 2.3
var weight;
By gender;
2.4
class statement
The
class statement is similar to by statement except it can contain
numeric
variables with lesser groups and output will be shown in a single table
whereas
in the by statement it will be shown in separate table for each of the
groups.
The syntax of the class statement is
class
<variable name1 variable name 2 >
Example 2.4
var age;
class weight;
3. SAS
Procedures
The
following chapters discusses some of the commonly used SAS procedures
related
data preparation, descriptive analysis, inferential analysis and mode
building,
report, graph and output preparation
3.1
Data
Preparation level procedures
3.1.1
proc sort
3.1.2
proc format
3.1.3
proc transpose
3.2
Descriptive
analysis
3.2.1
proc freq
3.2.2
proc means
3.2.3
proc
univariate
3.2.4
proc summarize
3.2.5
proc tabulate
3.2.6
proc corr
3.3
Inferential
analysis
3.3.1
proc anova
3.3.2
proc reg
3.3.3
proc ttest
3.3.4
proc glm
3.3.5
proc genmod
3.3.6
proc mixed
3.3.7
npar1way
3.4
Report, graph
and output preparation
3.4.1
proc print
3.4.2
proc content
3.4.3
proc plot
3.4.4
proc
gplot
3.4.5
proc boxplot
Tutorial on Introduction to
biostatistics
Table of contents