Consolidated Logging Repository (CLR) Metrics Aging
DESIGN Repository Team 10 April 2007
© 2008, Sabre Inc. All rights reserved. This documentation is the confidential and proprietary intellectual property of Sabre Inc. Any unauthorized use, reproduction, preparation of derivative works, performance, or display of this document, or software represented by this document, without the express written permission of Sabre Inc. is strictly prohibited. SabrePat Furrow and the Sabre logo design are trademarks and/or service marks of an affiliate of Sabre Inc. All other trademarks, service marks, and trade names are owned by their respective companies.
Table of Contents
Metrics Aging Design...............................................................................................1 History....................................................................................................1
Over view Problem Statement................................................................................1 Objectives..............................................................................................1 Assumptions..........................................................................................1 CLR Environments and Retention Requirements..................................1
Database Design MySQL Accounts and Connection Parameters......................................2 Level 0 – Detail Logging (Staging Area)................................................3 Level 1 – Weekly Summary...................................................................5 Level 2 – Monthly Summary..................................................................6 Level 3 – Annual Summary....................................................................7 Logging Subscriber................................................................................8
Requirements Monitoring..............................................................................................9 Failure Recovery....................................................................................9 Aging Process......................................................................................10 Backlog for Future Releases...........................................................10 Test Plan..............................................................................................10 Security................................................................................................11 Issues from Reviews............................................................................12
Appendix Issues from Reviews...............................................................................
Consolidated Logging Repository (CLR) Metrics Aging February, 2008 Confidential and Proprietary Sabre Inc.
iii
21
Overview
History The Consolidated Logging Repository (CLR) was created to provide a central store of operational metrics used to manage the many computing systems at Sabre. Operational metrics are captured from the Sabre services instrumented with the Integrated Computing Environment (ICE) logging API. There are five types of logging that a service can generate: metrics, application, security, billing and audit. Metrics logging is designed to measure the behavior of a service and was the first logging type to be organized into a relational database in the CLR. Problem Statement The initial MySQL schema used to store metrics logging closely matched the actual format of the logged data. Growth in volume of the metrics logging has now exceeded the capacity of the schema (report and ad-hoc query performance is significantly degraded) and exhausted the available storage capacity. Objectives The goal of the metrics aging solution is to provide summarized levels of metrics to: •
meet operational management requirements
•
optimize the performance of reporting and ad-hoc analysis, and
•
minimize the repository costs.
Assumptions The metrics aging solution will be implemented in 2Q07. The metrics aging solution will use a current release of MySQL 5.0.
Database Design
Consolidated Logging Repository (CLR) Metrics Aging February, 2008 Confidential and Proprietary Sabre Inc.
CLR Environments and Retention Requirements
1
Environment
Servers
SAN Storage
Aging Level
Retention Period
Production
ICESLP07/08
1500 GB
Detail Minute Hour Day
14 days 13 weeks 15 months 20 years
Certification
ICESLC05/06
500 GB
Detail Minute Hour Day
14 days 6 weeks 4 months 1 year
Integration
ICESLI10/11
500 GB
Detail Minute
14 days 6 weeks
Development
ICESLP09
500 GB
Detail
14 days
MySQL Accounts and Connection Parameters Environment
Servers
P ort
Databases
Account – Use
Production
ICESLP07/08
3306
fabric_prod_his t
iceread – reporting, ad-hoc analysisread only
Certification
ICESLC05/06
3306
fabric_cert_hist
iceadmin – subscriber, summary & aging scripts
Development
ICESLP09
3306
fabric_dev_hist
iceread – read only ice – administration
Consolidated Logging Repository (CLR) Metrics Aging February, 2008 Confidential and Proprietary Sabre Inc.
2
Level 0 – Detail Logging (Staging Area)
Consolidated Logging Repository (CLR) Metrics Aging February, 2008 Confidential and Proprietary Sabre Inc.
3
Consolidated Logging Repository (CLR) Metrics Aging February, 2008 Confidential and Proprietary Sabre Inc.
4
Level 1 – Weekly Summary
Consolidated Logging Repository (CLR) Metrics Aging February, 2008 Confidential and Proprietary Sabre Inc.
5
Level 2 – Monthly Summary
Consolidated Logging Repository (CLR) Metrics Aging February, 2008 Confidential and Proprietary Sabre Inc.
6
Level 3 – Annual Summary
Consolidated Logging Repository (CLR) Metrics Aging February, 2008 Confidential and Proprietary Sabre Inc.
7
Requirements
3
Logging Subscriber The ICE logging database subscriber will need to be modified to insert the metrics logs it receives into the new Level 0 table. Major changes in this schema include: •
CLIENT_IND is part of the unique key to prevent duplicate inserts.
•
All calculated statistics have been removed (AVG, STD_DEV). These will be calculated in the BusinessObjects universe.
•
The bucket table has been separated into two distribution tables.
•
The only foreign key relationships are on CLR_IND and METRIC_ID which are enforced by MySQL.
•
The size of the SUM_OF_SQ columns has been increased to BIGINT UNSIGNED.
•
All metrics (counts, times and sizes) are unsigned and will not accept negative values.
The following changes in the subscriber logic will be required to accommodate the new aging schema and processes: 1. The staging level of the metrics schema Each day will have its own unique set of tables with the date specified in each table’s name as followswill have the following five tables: a. METRIC_L0_yyyymmdd b. METRIC_L0_COUNT_yyyymmdd c. METRIC_L0_ITEM_yyyymmdd d. METRIC_L0_ITEM_DISTRIB_yyyymmdd e. METRIC_L0_DISTRB_yyyymmdd – This table’s structure is identical to METRIC_L0_ITEM_DISTRIB but will only contain the optional distribution detail for the metrics stored in the main metricMETRIC_L0 table. The metric to which the distribution relates will be indicated by specifying one of the following values for ITEM_NAME: 1)EXIST_TIME 2)IN_MSG_SIZE 3)OUT_MSG_SIZE Metrics logs should be inserted into the day table matching the
Consolidated Logging Repository (CLR) Metrics Aging February, 2008 Confidential and Proprietary Sabre Inc.
8
GENERATED_TIME of the log. 2. If any of the proper L0 tables are not available for inserts (table with that date does not exist, table is locked, MySQL is down, etc.), the whole metrics log should be written to the secondary (recovery) log file and no part of the log should be inserted in any L0 table. An application error should be generated each time an insert fails and this in turn should trigger an alert to operations staff every each 5 minutes that the errors continue to occur to alert operations staff that metrics inserts are failing. Note
We would expect to get errors/alerts on any erroneous future date logging that isn’t corrected prior to this implementation.This section contains examples and instructions for the various styles to use with graphics.
Monitoring 1. Operations staff should be alerted whenever the subscriber processes are not running. 2. The subscriber generated errors process should be monitored and in turn generate an application erroralert every 5 minutes to alert notify Fabric operations Coverage staff (or MRC for test environments) when inserts of logging to the MySQL L0 tables are failing. Note
Erroneous future date logging will also generate these alerts.
3. Logging patterns and aging summarization should be validated daily and generate email alerts to the repository team if anomalies are detected. This validation can likely be implemented using BusinessObjects published reports. Types of anomalies to be detected include: a. New logs/missing logs for a node or method b. Significant changes in volume of logging c. Missing logging intervals d. Mismatch between the two database images Failure Recovery The CLR maintains two parallel database images at all times in order to prevent loss of logging data when planned or unplanned failures occur. Failures can occur at several levels: •
Subscriber process
•
Aging process
•
MySQL database
•
CLR server
Consolidated Logging Repository (CLR) Metrics Aging February, 2008 Confidential and Proprietary Sabre Inc.
9
Within 48 hours of recovering from such a failure, each database image should be restored to a full replica of the logging data by using one of the following processes: 1. Logging recovery process – The subscriber will read the recovery log it generated during the failure and inserted the saved logs into the L0 tables on the same server where the failure occurred. 2. Table copy process – Whole tables can be restored by dropping the incomplete table on the database which suffered the failure and copying the complete table from the opposite server’s database. 3. Reconciliation process – If neither database image has a complete table of logging for a given dayweek, month or year, a process to read a specified time range of data from one server’s database and insert it into the other will be used to generate a complete logging table on one server. Then the table copy process can be used to restore the other database’s table. 4. Backup/Restore process – The database on each server (including InnoDB and MyISAM tables) must be backed up daily. If multiple tables in a database are corrupted, a previous day backup from either server or a current backup from the other server can be restored. Aging Process 1. Summarize any Level 0 tables where the latest RECEIVED_TIME is greater than the last summarization process run time. 2. Purge any Level 0 data that is older than the specified retention period (initially 14 days). 3. Compress any Level 1 – 3 table once the last possible Level 0 table it could containdetail for any date contained in that summary level is dropped. Apply best practices to guarantee the integrity of tables during compression. 4. Schedule aging/table compression to minimize the impact to reporting and adhoc query activity. Backlog for Future Releases
5. Create a summary table of intervals dropped from Level 0 due to zero counts. 6. Create a table of peak intervals from Level 0 detail. Test Plan 1. Test each failure recovery process. 2. Test existing Business Objects reports with new schema. 3. Test subscriber when: a. L0 table does not exist (both future date and past date logging)
Consolidated Logging Repository (CLR) Metrics Aging February, 2008 Confidential and Proprietary Sabre Inc.
10
b. Database backup is running c. Past date data is received and L0 table does exist d. Duplicate logging is received Security
Consolidated Logging Repository (CLR) Metrics Aging February, 2008 Confidential and Proprietary Sabre Inc.
11
Appendix
4
Issues from Reviews The following issues were raised during various reviews of the database and schema design. They are documented here for future reference. 1. Summarization processes need to be designed and documented as well as the schema. See the Aging Process Requirements. 2. What happens if the subscriber can’t find the right Level 0 table to insert the metrics log? See the Logging Subscriber Requirements. 3. Consider naming Level 0 tables with fixed names instead of dynamic to avoid issue 3. See the Logging Subscriber Requirements. 4. What happens to logging received after the Level 0 table for that day has already been summarized? See the Logging Subscriber and Aging Process Requirements. 5. What is the best time of day to run the summarize process? Schedule after logging for day is usually complete but before automated report runs begin. 6. How will the switch from daylight savings to standard time be handled? Would storing datetime in UTC format solve this? Use timestamp data type instead of datetime? Per Mark Scriffiny, the ICE API is passing all times in GMT and they are stored that way in the MySQL database. Per MySQL we are converted from DATETIME data types to TIMESTAMP data types. 7. MySQL security needs to be designed and implemented at the same time that we deploy the new schema and processes. Added security requirements to document. 8. Resynch processes need to be defined and tested to make sure no schema changes are needed. Added resync process requirements to the document. The CLR_IND was added to the table structure to enable easier resynch of data following a failure. 9. Apply best practices to guarantee the integrity of tables during compression. Added to Aging Process Requirements. 10. Schedule table compression to minimize the impact to reporting and ad-hoc query activity. Added to Aging Process Requirements. 11. Consider creating a summary table of intervals dropped from Level 0 due to zero counts. Added to Aging Process Backlog Requirements. 12. Consider creating a table of peak intervals from Level 0 detail. Added to Aging
Consolidated Logging Repository (CLR) Metrics Aging February, 2008 Confidential and Proprietary Sabre Inc.
12
Process Backlog Requirements. 13. Consider keeping the INSTANCE_NAME in all levels of summary. Will this cause indexes to be much larger even if the column is left blank? A null varchar value will require 8 bytes of storage. We will keep INSTANCE_NAME in all summary levels. 14. Should we eliminate the RANGE_5_END columns since this will always be equal to the max for that metric? No, if five or fewer levels are specified, the upper limit will be set based on the XML input. 15. What will the aging model be for development, integration and certification test environments? Development: Level 0 Integration: Level 0 – 1 Certification: Level 0 – 3 just like production 16. Test all existing Business Objects reports with the new schema. Added to Test Plan. 17. Will the BIGINT data type be large enough for a whole day sum of squares? Yes, the largest unsigned value for BIGINT is 18,446,744,073,709,551,615. 18. How do we guarantee uniqueness of the primary key through summarization and resynchronization processes? Include an integer prefix in the auto-generated integer primary key containing the name of the table (yyyymmdd) followed by a single digit indicating to which database the row was originally inserted. 19. It was agreed during the initial review of this design with the ICE development team that the Level 0 tables will not be partitioned by data, but will be a single set of 5 tables with a rolling 14 days of detail maintained in each.
Consolidated Logging Repository (CLR) Metrics Aging February, 2008 Confidential and Proprietary Sabre Inc.
13