Session 2 “Getting Started” Core Skills for Data Processing ORSC 2004 - Internal Training
1 1
Core Skill Training Session Six: “Data Analysis”
Objective At the end of the training program, participants should be able to
Understand data layouts
Understand how tables will look like
Defining data structure for various formats of data
Understand coding conventions
Get an appreciation of basic elements
2
Various data formats
Questionnaire data can be computerised in many ways
Market Research software mostly uses FLAT files
There are customised software available for capturing MR data
QINPUT, MERLIN, Surveycraft are some of the most popular ones
3
Single Card data Serial Number/ Respondent ID 1000290022
00061860200310041324 040800100000000000 1.3979167
R1
1000390022
00061860200310041359 040800100000000000 0.6460563
R2
1001210022
00061860200310041249 040800100000000000 0.8865789
R3
1013240022
00061867200310051800 040800100000000000 0.6759740
R4
1013250022
00061867200310051831 040800100000000000 0.8857447
R5
1013260022
00061867200310051842 040800100000000000 1.3810526
1013300022
00061867200310051857 040800100000000000 1.5300000
1015240022
00062321200310041216 040800100000000000 1.4328262
Record length Respondent ID is the unique ID for the record Number of lines in the file = Sample Size Maximum Length of record = 32,767 (Size of integer) 4
Multicard data R1
R2
00048011 01 04070917213204070917374232570237550 000480202837525750 111020744t242-345235849862468-2486 0004803 1 111-4 208050505050810 245248609824096 0004804001010 55334333333433453145555413155 646890 0004805 2115245444433353443442343435514334333 425924 00070011 01 040709173010040709175624 245982496 000700201395277173 231019074646464060 0007003 1 112-7 105080803050308 426246 0007004030707 33543553245533535255452355555553 0007005 21113123322&2133222122431232323212313
Each respondent will have more than 1 line of information called “CARD” In general the length of card is 99 characters Can also have more than 99 card length Unique identification in this data format is Respondent ID + Card ID Maximum Length of record = 32,767 (Size of integer). Maximum record Length in this case is sum of record lengths of all cards
5
Quantum data format
Quantum can handle both single card/ multicard data formats
In both the formats, quantum allows something called multipunch
In multi-punch data format, each column is capable of holding 12 values – the individual constants, 0123456789-&.
Any combination of the above 12 codes (punches) can exist in a single column
The advantage of using this format is more data can be fit into the available maximum record length – 32,767 chars
6
Introducing Quantum – What does it do?
Check and validate the data
Edit and correct the data
Produce different types of lists and reports of data
Produce new data files
Recode data and produce new variables
Generate tables
Perform Statistical Calculations
7
Underlying concepts Quantum consists of 2 phases or sessions
For each questionnaire: -Check and correct data -Modify/ Recode data
Count questionnaires Produce Tables Format tables
Edit Section Tabulation Section
8
Underlying concepts Edit section •Data examination •Data modification •Data correction
Tables section •Cross tabulation of data •Control statements to determine layout
9
Layout of a table Table title Project Heading X-break Base size
Base Title Side headings
Frequency Percentage
Mean score 10
Coding conventions A Quantum program is a file created using an editor – Text editor The tables section consists of statement types Each
statement starts on a new line
Each
statement consists of parameters and options
A
statement may be up to 200 characters
The standard Quantum separator is the semi-colon (;)
Long statements may be continued on new lines with a + in the first position. In certain cases long statements may be continued with a ++ in the first position
Comments are denoted by /* at the start of the line. You may see Quantum programs that use C at the start a line for comments.
11
Coding conventions A Sample of Quantum Program
/* /* Here is a comment /*
tab q5 brk1;c=c115’1’;nz +dsp
12
Fundamentals and Terminology
13
Fundamentals Individual constants These are ASCII characters or multicodes which are any combination of the codes 1234567890-& or blank alone. They are enclosed in single quotes: ‘1’ ‘2’ ‘123’ ‘ ‘…. A slash (/) between two numbers denotes ‘through’ in the order &-01234567890-&.
Punch codes are referenced in apostrophes. Punches are listed individually and range of punches is denoted by a / to represent through
Examples: ‘1’ Punch 1 ‘1/5’
;
‘123’
Punches 1 or 2 or 3
Punches 1 or 2 or 3 or 4 or 5;
‘ ‘ no punches (blank)
Order of punches is & - 0 1 2 3 4 5 6 7 8 9 0 - &
‘&/9’ is the same as ‘1/&’
14
Fundamentals Individual constants The – punch is sometimes referred as the 11th or X punch, and & is sometimes referred as 12th or Y or V punch. Each code represents one answer to a question. For example, ‘What is your favorite color?’ which has the response list: Red
:
1
Yellow :
2
Blue
:
3
Green
:
4
Black
:
5
White
:
6
coded into one column. If my favorite color is green, this will appear in the data file as a 4 in the appropriate column, just as if your favorite color is red, there will be a 1 in that column. 15
Fundamentals Strings of Data Constants Strings are lists of single ASCII characters. They are enclosed in dollar signs ($). Strings are referenced in dollar signs Refer to more than one column of data Examples: $1234$ $ABC$ $
$
16
Fundamentals Numbers - Whole Numbers - Real Numbers Variables: Variables or arrays may be defined as being data, integer or real types. Names up to 10 chars. Example: int unit 1 real weight 10s whenever “s” is used varn is interpreted as var(n)
17
Variables/ column referencing
Columns are referred by their actual position in the data. This means, if you open the data file in any editor and see the cursor position on which the data is highlighted, the column position refers to the cursor position
In the case of single card data file, the actual column position itself is directly used for referring to a column. For example, c12 refers to column 12 in a single card data file
In the case of milticard data file, the column should be referred in combination with the card number. The format of column referencing is “cXNN” if the number of cards are less than 9 and “cXXNN” if the number of cards are more than 9. Where X refers to the card number and NN refers to the column position. One digit column positions should be referred by preceding the column number with “0”. Example:
c108 refers to 1st card 8th column
c412 refers to 4th card 12th position c1009 refers to 10th card 9th position
18
Variables/ column referencing
A series of columns may be considered as either string or numeric and is referenced as c(m,n) where m is the start column position and n is the end column position
Examples: c(12,15) refers to columns 12 to 15 in a single card data file c(106,110) refers to columns 6 to 10 of 1st card in a multicard data file
19
Describing Data Structure
20
Data Structure
By default Quantum reads one record or a line from your data file at a time. Each record may be up to 100 columns long
Most Market Research surveys consist of multi-card records
Some surveys consist instead of long records with more than 100 columns of data
These data structure must be described on the struct statement
Format: struct;options
The “struct” statement must be the first statement in your program
21
Data Structure – contd.. Specifying Long records struct;reclen=n where n is the length of the record in columns the maximum length of a record is approximately 32,000 columns Specifying Multi-card Data Sets This is the most common form of struct statement struct;read=2;ser=c(m,n);crd=c(p,q) Where, read = 2 denotes a multi-card set; ser = defines the columns of the serial number; crd = defines columns of the card number Example: struct;read=2;ser=c(1,4);crd=c80
22
Data Structure – contd.. When a multi-card set is read, the cards are defined as follows: Card 1
Columns 101-200
Card 2
Columns 201-300
Card 3
Columns 301-400
Card 4
Columns 401-500
….. Card 10
Columns 1001-1100
By default a maximum of 9 cards are permitted in a set. Reading Multi-card data sets with 10 or more cards The option max=n is used to define the maximum number of cards in the set Example: struct;read=2;ser=c(1,5);crd=c(6,7); max=19
23
Data Structure – contd.. Checking the structure of multi-card data sets
Quantum automatically checks for - Duplicate card types within serial number and adjacent duplicate serial numbers
It is not mandatory that all cards should be present for every respondent in a multicard data file
It is possible check that specific cards are present using req= Example: struct;read=2;ser=c(1,5);crd=c(6,7); max=19;req=1,2
In this example each record must have a card 1 and 2 present. If either or both are missing the record will be rejected If you require a series of cards to be present specify the first and last separated by a slash struct;read=2;ser=c(1,5);crd=c(6,7); max=19;req=1/5
24