Analysis of Hotel Pricing Data I am Karan Jain , pursuing my B.Tech in Manufacturing Process and Automation Engineering (M.P.A.E.)from N.S.I.T, New Delhi. I wish to present my finding for the MBA Salaries data set that you altruistically provided to me to eke out relevant insights from. After analyzing the data for a while and considering many variables, I realized the burdensome task analyzing is. I have been committed in writing insights and managerial relevance’s for articles that you provide to me while inculcating the knowledge that you provide. The actual time that the task takes and the decision regarding selecting the particular feature to come up with a particular acceptable model struck me today when I had to think like a data scientist.
Introduction Hotel Pricing is a complex phenomenon involving myriad of characteristics of be factored into account when conducting the analysis and to determine the correct price to be set for a particular room. For example, one would be ludicrous in setting the price of a hotel room located in a non metro, non tourism city for a price at which even those who fulfill the above mentioned criteria glower. The objective of the analysis is so to crunch the data for the given 42 cities, of which some are metro and some non metro, thus covering cities fulfilling many such criteria, to yield a model where if provided a new city and a hotel with a given city of features, we are able to more or less predict the price for a particular room depending upon these characteristics. Embroiling and engulfing those characteristics leaving irrelevant at the margin is the key here
Describing the Dataset The data set provided involves the following variables :
Dependent Variable DECISION VARIABLE RoomRent
UNITS
MEANING
Rupees
Rent for the cheapest room, double occupancy, in Indian Rupees. Some hotels have more than one type of double occupancy room. For simplicity, we picked the cheapest room with double occupancy.
External Factors Many external factors can potentially influence the RoomRent. The dataset captures some of these external factors, as explained below. VARIABLE Date
UNITS Text
IsWeekend
Dummy
IsNewYearEve CityName Population
Dummy Text Number
CityRank
Dummy
IsMetroCity
Dummy
IsTouristDestination
Dummy
MEANING We have hotel room rent data for the following 8 dates for each hotel: {Dec 31, Dec 25, Dec 24, Dec 18, Dec 21, Dec 28, Jan 4, Jan 8} If a hotel is sold out on a given date, assume that the price of the hotel room on the date it is sold out is the maximum price from the sample of dates for which prices are available. We use ‘0’ to indicate week days, ‘1’ to indicate weekend dates (Sat / Sun) ‘1’ for Dec 31, ‘0’ otherwise Name of the City where the Hotel is located e.g. Mumbai` Population of the City in 2011 (See Table A1 below) Rank order of City by Population (e.g. Mumbai = 0, Delhi = 1, so on); (See Table A1) ‘1’ if CityName is {Mumbai, Delhi, Kolkatta, Chennai}, ‘0’ otherwise We use ‘1’ if the city is primarily a tourist destination, ‘0’ otherwise. For example, Goa and Agra are primarily tourist destinations. We assume that most people who visit Goa and Agra and stay in their hotels are in these cities primarily for tourism.
Internal Factors Many Hotel Features can influence the RoomRent. The dataset captures some of these internal factors, as explained below. VARIABLE HotelName
UNITS Text
MEANING e.g. Park Hyatt Goa Resort and Spa
StarRating Airport HotelAddress HotelPincode HotelDescription FreeWifi FreeBreakfast HotelCapacity HasSwimmingPool
Number km Text Number Text Dummy Dummy Number Dummy
e.g. 5 Distance between Hotel and closest major Airport e.g. Arrossim Beach, Cansaulim, Goa 403712 e.g. 5-star beachfront resort with spa, near Arossim Beach ‘1’ if the hotel offers Free Wifi, ‘0’ otherwise ‘1’ if the hotel offers Free Breakfast, ‘0’ otherwise e.g. 242. (enter ‘0’ if not available) ‘1’ if they have a swimming pool, ‘0’ otherwise
Getting Started in Interpreting results to go about calculating Hotel Room Prices: At first , one may find oneself in the dense jungle of unstructured data with this deluge of information of 42 cities which contain as many as 12 distinct features in helping you decide the cogent room price failing which all the effort goes down an erroneous path. I first drew a correlation diagram to get the basic idea in how to frame the variables and their relation in fluctuating room prices over varying metrics. I won’t discuss in detail the technicality of the approach but I would like to keep the reader in the thick of the developing situation which can be done by visuals, often effective. Here are the correlation diagrams which I split in two phases so that one can have a better understanding of the variables
With this you may be able to relate many of the variables but there was this correlation with room rent that held me flabbergasted as some of the variables showed no correlation whatsoever despite the logical relevance in the relation. What I acquired from the analysis was that there are some variables which we logically relate to room rent but are instead related to occupancy. We tend to think along the lines , higher the occupancy , cheaper the hotel which may not be true as it circumscribes certain conditions for that catastrophic conclusion to be
drawn. The variables which related to occupancy and not the room price were as follows:
Airport Free Wifi service Weekends incidence Has Breakfast
Which in layman jargon would translate to the constraints: whether hotel is nearer to an airport or not, whether hotel provides wireless fidelity service or not, whether the day hotel sold out a room happens to be a weekend or not and whether the edifice believes in providing complimentary breakfast or not. I concluded the following logical table of relation which in first glance would let you acknowledge in advance the facsimile of model to be developed later in this report. S.no
Star Capacity Rating Room +ve +ve Rent
Swimming Tourist Destination +ve +ve
New Year Incidence +ve
Metro
Population
-ve
-ve
The aforementioned table clearly indicates the relation of a particular variable with Room Rent.
Linear Model Formulated: I now had the burden of using these 7 features aforementioned to determine a quality equation yielding me the price least variant from the original. But I wanted to include the interaction between variables to include the nuanced inter relation and reduce the isolated erroneous relations. Case 1) Consider the variables Metro and Tourist Attraction. I can really surmise these variables by considering the Tourist Attraction variable to determine the room
price with the condiment of interaction by the metro variable as a minor influencing metric. Case 2) Consider the variables New Year Eve and Tourist Attraction. Again following the same line of thought I would characterize New Year eve as an influencing metric. It can be logically related as hotel prices seem to escalate and shoot up when travelling around new year for the sheer demand even for petty rooms on the account of slew of travelers.
Thus my linear model included such interaction variables in addition with other variables. Here is a glimpse of my model and its summary characteristics: > summary(model7) Call: lm(formula = RoomRent ~ HasSwimmingPool + HotelCapacity + Population + (IsMetroCity:IsTouristDestination) + IsTouristDestination + StarRating + HasSwimmingPool:IsNewYearEve + IsNewYearEve:IsTouristDestination, data = pricingtrain) Residuals: Min 1Q Median -13653 -2356 -651
3Q Max 1062 309334
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -8.345e+03 4.428e+02 -18.845 < 2e-16 *** HasSwimmingPool 1.797e+03 1.961e+02 9.163 < 2e-16 *** HotelCapacity -1.113e+01 1.252e+00 -8.893 < 2e-16 *** Population -6.864e-05 3.256e-05 -2.108 0.0350 * IsTouristDestination 2.307e+03 2.033e+02 11.351 < 2e-16 *** StarRating 3.699e+03 1.333e+02 27.749 < 2e-16 *** IsMetroCity:IsTouristDestination -1.447e+03 3.474e+02 -4.165 3.14e-05 *** HasSwimmingPool:IsNewYearEve 1.768e+03 4.471e+02 3.954 7.74e-05 *** IsTouristDestination:IsNewYearEve 7.727e+02 3.213e+02 2.405 0.0162 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6908 on 9915 degrees of freedom Multiple R-squared: 0.1795, Adjusted R-squared: 0.1788 F-statistic: 271.1 on 8 and 9915 DF, p-value: < 2.2e-16
The variable ‘7’ appended after model clearly indicates the times I failed to yield a coherent analysis or failed to substitute a variable with a better one, or ended up adding a redundant variables. Conclusion: I would like to state the following statement to help hotel manager follow the statement and rate the room as the most expensive. Clustering according to a statement which I yielded from a model is my conclusion. It is as follows A hotel which has a swimming pool, considerable capacity, a high satiating rating for comfort, is located in a town which has amenities of a cosmopolitan city but is mainly a tourist attraction should classify themselves as wheat from the chaffe. If the day of soliciting rooms happens to be on a new year, Hotel Company might end up turning a downslide in revenue, if there was one.