12. Combining SAS datasets 268 Appending datasets in different situa)ons PROC PRINT DATA=Lib9_3.emps; PROC PRINT DATA=Lib9_3.emps2008; PROC PRINT DATA=Lib9_3.emps2009; PROC PRINT DATA=Lib9_3.emps2010; 269 PROC APPEND PROC APPEND BASE=base-dataset DATA=data-dataset; - BASE = names the dataset to which observa)ons are added - DATA = names the dataset containing observa)ons that are added to the base dataset SAS does not create a new dataset: PROC APPEND adds the observa)ons of a dataset at the end of the base-dataset - Only two datasets can be used at a )me in one step - The observa)ons in the base-dataset are not read - The variable informa)on in the descriptor por)on of the base-dataset cannot change DATA work.emps; SET Lib9_3.emps; PROC APPEND BASE=work.emps DATA=Lib9_3.emps2008; PROC PRINT DATA=work.emps; 270 A second example If the base-dataset contains variable(s) not present in the data-dataset, SAS specifies a missing value for this variable(s) PROC APPEND BASE=work.emps DATA=Lib9_3.emps2009; PROC PRINT DATA=work.emps; 271 The FORCE op)on PROC APPEND BASE=work.emps DATA=Lib9_3.emps2010; PROC PRINT DATA=work.emps; This program will not do its job.. See the log PROC APPEND BASE=base-dataset DATA=data-dataset FORCE; We need it when the data-dataset contains variables that - Are not in the base-dataset - Are of different type (character/numeric) - Are longer than the variables in the base-dataset 272 What SAS does when the FORCE op)on is used PROC APPEND BASE=work.emps DATA=Lib9_3.emps2010 FORCE; PROC PRINT DATA=work.emps; 273 Using the DATA step for concatena)ng datasets DATA output-dataset ; SET dataset1 dataset2 ; addi+onal SAS statements ; - output-dataset names the dataset to be created - dataset1 dataset2 are the datasets to be read This program creates a new dataset in which observa)ons of successive datasets are added at the end of the previous datasest, as listed in the SET statement How the DATA step works with two datasets: - SAS reads variables from the first dataset and a PDV is created - SAS reads variables from the second dataset and eventually the PDV is completed - SAS reads each observa)on in the first dataset, writes it in the PDV and from there in the output dataset - SAS reads each observa)on in the second dataset, writes it in the PDV and from there in the output dataset The new dataset contains all of the variables and observa)ons from all of the input datasets 274 Concatena)ng datasets with different variables empsdk empsfr empscn empsjp PROC PRINT DATA=Lib9_3.empsdk; PROC PRINT DATA=Lib9_3.empsfr; PROC PRINT DATA=Lib9_3.empscn; PROC PRINT DATA=Lib9_3.empsjp; Same name DATA empsall1; SET Lib9_3.empsdk Lib9_3.empsfr; PROC PRINT DATA=empsall1; Different name DATA empsall2; SET Lib9_3.empscn Lib9_3.empsjp; PROC PRINT DATA=empsall2; 275 RENAME= op)on You can rename a variable using the RENAME op)on in a DATA step (RENAME= (old-name1 = new-name1 old-name2 = new-name2 old-namen = new-namen)) - old-name : the variable to be renamed - new-name : the new name for the variable - You can use rename in different statements of a data step (e.g. MERGE) DATA empsall2; SET Lib9_3.empscn Lib9_3.empsjp ( RENAME= (Region = Country) ); PROC PRINT DATA=empsall2; N.B.: RENAMES= op)on does not rename the variable in the input dataset, but it tells SAS which slot in the PDV to use when building the observa)ons 276 How concatena)on works if there are common variables If variables from different input datasets share the same name and: - Different type a{ribute, SAS stops proceedings the DATA step and gives an error message - Different length, label, format or informat a{ributes, SAS takes the a{ributes of the variable in the first dataset PROC CONTENTS DATA=Lib9_3.empscn; DATA empscn_gio; LENGTH Country $ 5; SET Lib9_3.empscn; PROC PRINT DATA=empscn_gio; DATA empsall3; SET empscn_gio Lib9_3.empsdk; PROC PRINT DATA=empsall3; 277 Appending vs Concatena)ng (1) 278 Appending vs Concatena)ng (2) 279 Interleaving datasets 280 DATA step for interleaving DATA dataset; SET SET-dataset1 SET-dataset2.. ; BY DESCENDING BY-variable(s); addi+onal statements ; - dataset : names the data to be created - SET-dataset : the datasets to be read - BY-variable(s): variables used to inteleave observa)on N.B.: Each input dataset must be sorted by the BY-variable DATA empsall4; SET Lib9_3.empscn Lib9_3.empsjp (RENAME=(Region=Country)); BY First; PROC PRINT DATA=empsall4; - SAS sorts duplicate BY-values in different datasets in the order in which their datasets are listed in the SET statement - SAS sorts duplicate BY-values in a dataset in the order in which they appear in the dataset 281 One-to-one merging DATA output-dataset; SET input-dataset1 ; SET input-dataset2 ; - output-dataset names the dataset to be created - input-dataset1 and input-dataset2 specify the dataset to be read - The new dataset contains all the variables from all the input datasets. - If the datasets contain variables that have the same name, the values which were read from the last dataset overwrite the values which were read from earlier datasets - The first observa)on in one dataset is joined with the first observa)on in the other, and so on. - The DATA step stops a er reading the last observa)on from the smallest dataset. Hence, the number of observa)on in the new dataset is the number of observa)on of the smallest dataset DATA emps_oto; SET Lib9_3.empsau ; SET Lib9_3.PhoneH ; PROC PRINT DATA=emps_OTO; 282 One-to-one merging DATA output-dataset; SET input-dataset1 ; SET input-dataset2 ; - output-dataset names the dataset to be created - input-dataset1 and input-dataset2 specify the dataset to be read - The new dataset contains all the variables from all the input datasets. - If the datasets contain variables that have the same name, the values which were read from the last dataset overwrite the values which were read from earlier datasets - The first observa)on in one dataset is joined with the first observa)on in the other, and so on. - The DATA step stops a er reading the last observa)on from the smallest dataset. Hence, the number of observa)on in the new dataset is the number of observa)on of the smallest dataset. DATA emps_oto; SET Lib9_3.emps ; SET Lib9_3.emps2010 ; PROC PRINT DATA=emps_OTO; 283 Match-merging Match-merging means to combine observa)ons from two or more datasets into a single observa)on in a new dataset according to the values of a common variable DATA output-dataset; MERGE input-dataset1 input-dataset2 ; BY DESCENDING BY-variable(s) ; REMEMBER TO SORT THE INPUT DATASET! If you don t do it, SAS gives you an error - output-dataset names the dataset to be created - input-dataset1 and input-dataset2 specify the dataset to be read. If you specify only a dataset, MERGE statement behaviors as SET statement - BY-variable, whose values are used to match observa)ons. They must have the same type in all datasets to be merged - DESCENDING op)on must be applied if by-variable is sorted in descending order in the input datasets - Basic match-merging produces an output dataset that contains values from all observabons in all input dataset - If an input dataset doesn t have any observa)ons for a par)cular value of the byvariable, then the observa)on in the output dataset contains missing values for the variables that are unique to that input dataset - Match merging can be done also using a PROC SQL step 284 One-to-one match merging: Ex. #1 DATA emps_oto; SET Lib9_3.empsau ; SET Lib9_3.PhoneH ; PROC PRINT DATA=emps_OTO; DATA emps_oto; MERGE Lib9_3.empsau Lib9_3.PhoneH ; BY EmpID ; PROC PRINT DATA=emps_OTO; 285 One-to-One match merging: Ex. #2 PROC SORT DATA= Lib9_3.emps OUT=sortemps; BY First; PROC SORT DATA= Lib9_3.emps2010 OUT=sortemps2010; BY First; DATA emps_oto; MERGE sortemps sortemps2010 ; BY First ; PROC PRINT DATA=emps_OTO; DATA emps_oto; SET sortemps; SET sortemps2010; PROC PRINT DATA=emps_OTO; 286 One to Many match-merging DATA emps_otm; MERGE Lib9_3.empsau Lib9_3.PhoneHW ; BY EmpID ; PROC PRINT DATA=emps_OTM; 287 How SAS processes the step: In the compila)on phase, SAS: compila)on phase - Creates the PDV reading the datasets in the order they are listed in the MERGE statement - Assignes a tracking pointer to each dataset 288 How SAS processes the step: execu)on phase (1) SAS reads the first observa)ons of each dataset: they are in the same by-group OUTPUT DATASET The values remain in the PDV (as the input datasets are SAS datasets) and SAS starts the second itera)on 289 How SAS processes the step: execu)on phase (2) SAS reads the second observa)ons of each dataset: since the obs in the second dataset is in the same by-group as the obs the PDV, SAS overwrites the values of Type and Phone in the PDV PDV OUTPUT DATASET The values remain in the PDV and SAS starts the third itera)on 290 How SAS processes the step: execu)on phase (3) PDV SAS reads the second observa)ons from the first dataset and the third observa)on from the second dataset Since they are not in the same by-group as the obs in the PDV, SAS reini)alizes the PDV to missing data SAS reads the current observa)ons from the two dataset in PDV: PDV The values remain in the PDV and SAS starts the fourth itera)on To the output dataset 291 How SAS processes the step: execu)on phase (4) SAS con)nues processing in this way un)ll the end of both the datasets 292 No matches case What does it happen when there are groups (levels) of the BY-variable which are not shared by the INPUT datasets? DATA emps_nm; MERGE Lib9_3.empsau Lib9_3.PhoneC ; BY EmpID ; PROC PRINT DATA=emps_NM; 293 No macth case: How SAS processes In the compilabon phase, SAS: the compila)on phase - Creates the PDV reading the datasets in the order they are listed in the MERGE statement - Assignes a tracking pointer to each dataset 294 No match case: How SAS processes the execu)on phase (1) SAS reads the first observa)ons of each dataset: they are in the same by-group OUTPUT DATASET The values remain in the PDV and SAS starts the second itera)on 295 How SAS processes the execu)on phase (2) SAS reads the second observa)ons of each dataset: since they are not in the same by-group as the obs in the PDV, SAS reini)alizes the PDV to missing data SAS chooses the observa)on in the first dataset ( ), and writes it in PDV.... And in the OUTPUT Dataset The values remain in the PDV and SAS starts the third itera)on 296 How SAS processes the execu)on phase (3) SAS reads the third obs of the first dataset and the second obs of the second dataset: since they are not in the same by-group as the obs in the PDV, SAS reini)alizes the PDV to MD SAS reads the current observa)ons from the two dataset in PDV and then in the OUTPUT DATASET The values remain in the PDV and SAS starts the fourth itera)on 297 How SAS processes the execu)on phase (4) SAS con)nues processing in this way un)ll the end of both the datasets - Matching observa)ons which contain data from both the datasets - Non-matching observa)ons which contain data from one of the datasets 298 Selec)ng only matches (or not-matches) (IN=variable) variable : the name of a temporary variable into the PDV which equals 1 if the observa)on matches and 0 otherwise Using IN= op)on in the MERGE statement with an IF statement allows selecbng only matches and non-matches DATA emps_onlym; MERGE Lib9_3.empsau (IN=emps) Lib9_3.PhoneC (IN=cell); BY EmpID ; IF (emps AND cell); PROC PRINT DATA=emps_onlyM; DATA emps_onlynm; MERGE Lib9_3.empsau (IN=emps) Lib9_3.PhoneC (IN=cell); BY EmpID ; IF (emps=0 OR cell=0); PROC PRINT DATA=emps_onlyNM; 299 Many to Many match-merging DATA emps_mtm; MERGE Lib9_3.empsAUUS Lib9_3.PhoneO; BY Country ; PROC PRINT DATA=emps_MTM; N.B.: - This procedure provides an dataset with six lines - If we want a 12-obs dataset (any combina)on of obs which share the same level) we must use the SQL PROC 300 Using data manipula)on techniques with a data merge 301 Who purchased what? PROC PRINT DATA=Lib9_3.Customer (OBS=20); PROC PRINT DATA=Lib9_3.Order_fact (OBS=20); PROC CONTENTS DATA=Lib9_3.Customer; PROC CONTENTS DATA=Lib9_3.Order_fact; PROC SORT DATA=Lib9_3.Order_fact OUT=Order_fact; BY Customer_ID; WHERE year(order_date)=2007; DATA CustOrd; MERGE Lib9_3.Customer Work.Order_fact ; BY Customer_ID ; PROC PRINT DATA=CustOrd (OBS=30); In the dataset CustOrd the same Customer ID is repeated for each order 302 Outpuzng two datasets We want to divide the merged dataset in two datasets: Orders (in which we retain only variables Customer_Name, Quan)ty and Total_Retail_Price) and NoOrders (in which we retain only variables Customer_Name and Birth_date). - In the first dataset we put the observa)ons which contain informa)on about the orders - In the second dataset we put observa)ons about customers which made no orders. DATA Orders (KEEP=Customer_Name Quan)ty Total_Retail_Price) NoOrders (KEEP=Customer_Name Birth_Date) ; MERGE Lib9_3.Customer Work.Order_fact (IN=order); BY Customer_ID ; IF order=1 THEN OUTPUT Orders; ELSE OUTPUT NoOrders; PROC PRINT DATA=Orders; PROC PRINT DATA=NoOrders; 303 Summing up the # of orders for each customer We want to create also a third dadaset, called Summary, in which there are two variables: Customer_Name and a new variable named NumberOrders, which counts the number of orders for each customer DATA Orders (KEEP=Customer_Name Quan)ty Total_Retail_Price) Summary (KEEP=Customer_Name NumberOrders) NoOrders (KEEP=Customer_Name Birth_Date) ; MERGE Lib9_3.Customer Work.Order_fact (IN=order); BY Customer_ID ; IF order=1 THEN DO; OUTPUT Orders; IF First.Customer_ID THEN NumberOrders=0; NumberOrders+1; IF Last.Customer_ID THEN OUTPUT Summary; END; ELSE OUTPUT NoOrders; PROC PRINT DATA=Summary; 304

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.