Crowdsourcing Construction Activity Analysis From Jobsite Video Stream.pdf

  • Uploaded by: Ignacio Berrios
  • 0
  • 0
  • December 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Crowdsourcing Construction Activity Analysis From Jobsite Video Stream.pdf as PDF for free.

More details

  • Words: 14,121
  • Pages: 19
Crowdsourcing Construction Activity Analysis from Jobsite Video Streams

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Kaijian Liu, S.M.ASCE 1; and Mani Golparvar-Fard, A.M.ASCE 2

Abstract: The advent of affordable jobsite cameras is reshaping the way on-site construction activities are monitored. To facilitate the analysis of large collections of videos, research has focused on addressing the problem of manual workface assessment by recognizing worker and equipment activities using computer-vision algorithms. Despite the explosion of these methods, the ability to automatically recognize and understand worker and equipment activities from videos is still rather limited. The current algorithms require large-scale annotated workface assessment video data to learn models that can deal with the high degree of intraclass variability among activity categories. To address current limitations, this study proposes crowdsourcing the task of workface assessment from jobsite video streams. By introducing an intuitive web-based platform for massive marketplaces such as Amazon Mechanical Turk (AMT) and several automated methods, the intelligence of the crowd is engaged for interpreting jobsite videos. The goal is to overcome the limitations of the current practices of workface assessment and also provide significantly large empirical data sets together with their ground truth that can serve as the basis for developing video-based activity recognition methods. Six extensive experiments have shown that engaging nonexperts on AMT to annotate construction activities in jobsite videos can provide complete and detailed workface assessment results with 85% accuracy. It has been demonstrated that crowdsourcing has the potential to minimize time needed for workface assessment, provides ground truth for algorithmic developments, and most importantly allows on-site professionals to focus their time on the more important task of root-cause analysis and performance improvements. DOI: 10.1061/(ASCE)CO.1943-7862.0001010. © 2015 American Society of Civil Engineers. Author keywords: Activity analysis; Construction productivity; Video-based monitoring; Workface assessment; Crowdsourcing; Information technologies.

Introduction On-site operations are among the most important factors that influence the performance of a construction project (Gouett et al. 2011). Timely and accurate productivity information on labor and equipment involved in on-site operations can bring an immediate awareness of specific issues to construction management. It also empowers them to take prompt corrective actions, thus avoiding costly delays. To streamline the cyclical procedure of measuring and improving the direct-work rates, the time proportion of activities devoted to actual construction, the Construction Industry Institute (CII) recently proposed new procedures for conducting activity analysis (CII 2010). Activity analysis offers a plausible solution for monitoring on-site operations and supports rootcause analysis on the issues that adversely affect their productivity. Nevertheless, the current procedure for implementing activity analysis has inefficiencies that prevents a wide-spread adoption. The limitations are (1) the large scale of the manual 1

Graduate Student, Dept. of Civil and Environmental Engineering, Univ. of Illinois at Urbana-Champaign, Newmark Civil Engineering Laboratory, 205 N. Mathews Ave., Urbana, IL 61801. E-mail: kliu15@ illinois.edu 2 Assistant Professor and National Center for Supercomputing Applications (NCSA) Faculty Fellow, Dept. of Civil and Environmental Engineering and Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign, Newmark Civil Engineering Laboratory, 205 N. Mathews Ave., Urbana, IL 61801 (corresponding author). E-mail: [email protected] Note. This manuscript was submitted on October 31, 2014; approved on March 26, 2015; published online on May 29, 2015. Discussion period open until October 29, 2015; separate discussions must be submitted for individual papers. This paper is part of the Journal of Construction Engineering and Management, © ASCE, ISSN 0733-9364/04015035(19)/$25.00. © ASCE

on-site observations that is needed to guarantee statistically significant workface data; and (2) the necessary visual judgments of the observers that may produce erroneous data because of the over-productiveness phenomenon caused by construction workers under direct observations, instantaneous reaction of the observers to benchmarking activity categories, the necessary distance limits to construction workers, and finally, observers’ bias and fatigue (Khosrowpour et al. 2014b). Current labor-intensive processes take away time from determining the root causes of issues that affect productivity and how productivity improvements can be planned and implemented (CII 2010). To address limitations of manual workface assessment, a large body of research has focused on methods that lead to automation. These methods range from application of ultra wide band (Cheng et al. 2011; Teizer et al. 2007; Giretti et al. 2009), radio frequency identification (RFID) tags (Costin et al. 2012; Zhai et al. 2009), and global positioning system (GPS) sensors (Pradhananga and Teizer 2013; Hildreth et al. 2005) to vision methods using video streams (e.g., Peddi et al. 2009; Teizer and Vela 2009; Rezazadeh Azar et al. 2012). The majority of these methods build on nonvisual sensors and track the location of the workers and equipment. However, without interpreting the activities and purely based on location information, deriving workface data is challenging (Khosrowpour et al. 2014a). For example, for drywall activities, distinguishing between direct-work or tool-time purely based on location is difficult because during these activities, the location of a worker may not necessarily change. In contrast to location tracking sensors, the growing number of cameras on jobsites and the rich information available in site videos provide a unique opportunity for automated interpretation of productivity data. Nevertheless, computer-vision methods are

04015035-1

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

not advanced enough to enable detailed assessments from these videos. This is because the methods for detection and tracking equipment and workers especially when workers interact with tools (e.g., Brilakis et al. 2011; Park and Brilakis 2012; Memarzadeh et al. 2013) and the methods for interpreting activities from long sequences of videos (e.g., Gong et al. 2011; Golparvar-Fard et al. 2013) are not mature. Beyond the CII-defined activities, the taxonomy of construction activities are also not fully developed to enable visual activity recognition at the operational level. Finally, training and testing models used in computer-vision methods for activity analysis requires large amount of empirical data which is not yet available to the research community. In the absence of efficient video interpretation methods, tedious manual reviewing will still be required to extract productivity information from recorded videos, and this takes away time from the more important task of conducting root-cause analysis. In this paper, a new workface assessment framework is introduced which provides an easy way to collect and interpret accurate labor and equipment activity information from jobsite videos. The idea is simple: The task of workface assessment is crowdsourced from jobsite video streams. By introducing an intuitive web-based platform on Amazon Mechanical Turk (AMT), the intelligence of the crowd is engaged for interpreting jobsite videos. The goal is to overcome the limitations of the current practices of activity analysis and also provide significantly large empirical data sets together with their ground truth that can serve as the basis for developing automated video-based activity recognition methods. Through extensive validation on various parameters of the platform, it is shown that engaging nonexperts on AMT to annotate construction activities on jobsite videos can achieve accurate workface assessment results. In the following section, the related works are reviewed, methods are introduced to develop this proposed tool, and experimental results are discussed.

Related Work Time-lapse photography and videotaping have proven for many years to be very useful means for recording workface activities (Golparvar-Fard et al. 2009). Since the earlier work of Oglesby et al. (1989) until now, many researchers have proposed procedures, guidelines, and also manual and semiautomated methods for interpretation of jobsite video data. Videos have the advantage of being understandable by any visually-able person, provide detailed and dependable information, and allow detailed reviews by the analysts and on-site management away from the work sites. In the next section, some of the most relevant works on the topics of video-based workface assessment are first reviewed. Then, a review on the concept of crowdsourcing is provided, followed by research on video annotation tools and existing databases for activity recognition. Computer-Vision Methods for Video-Based Construction Activity Analysis Over the past few years, many computer-vision methods have emerged for inferring the activities of workers and equipment from jobsite videos. A reliable method for video-based activity analysis requires two interdependent components: (1) methods for detecting and tracking resources; and (2) procedures for activity recognition. The majority of the previous works have addressed these components as two separate tasks. Brilakis et al. (2011) and Park et al. (2011) applied scale invariant feature transforms to track construction resources in both 2D and 3D scenarios. Teizer and Vela (2009), Gong and Caldas (2009), Rezazadeh Azar and McCabe (2012), and Chi and Caldas (2011) explored construction resources detection © ASCE

and tracking by using different visual feature representations and learning algorithms. Park and Brilakis (2012) and Memarzadeh et al. (2013) explored a more structured approach by using machine learning techniques and shape and color templates for detection and tracking. Kim and Caldas (2013) also explored joint modeling of worker actions and construction objects as the basis of construction worker activity recognition. A few studies have also been focused on end-to-end activity analysis. For example, Gong and Caldas (2009) presented a concrete bucket model trained by boosted cascade simple features to analyze its cyclic operations. Yang et al. (2014) explored a finite-state machine model to classify tower-crane’s activities into concrete pouring and nonconcrete material movements. Azar and McCabe (2012) introduced a logical framework using computer vision-based techniques to study construction equipment working cycles. Instead of assuming strong priors on the relationships between activities and locations as in the previous works, Gong et al. (2011), Golparvar-Fard et al. (2013), and Escorcia et al. (2012) proposed bag-of-words models with different discriminative and generative classifiers to recognize atomic activities of construction workers and equipment. By recognizing atomic activities, activities observed are classified in self-contained videos where in a single resource starts an activity and ends the same activity within the video. Khosrowpour et al. (2014b) is also one of the earliest attempts to recognize the full sequence of activities for multiple workers from red-green-blue-depth (RGB-D) data. RGB-D data collected from depth sensors such as Microsoft Kinect bypass the challenges in detection and tracking and provide sequences of worker-body skeleton which can be the input to such activity recognition methods. Jaselskis et al. (2015) also proposed an approach of monitoring construction projects in the field by off-site personal through live video streams. Despite the explosion of these methods, the ability to automatically recognize and understand worker and equipment activities is still limited. Challenges relate to the large existing variability in execution of construction operations, the lack of formal taxonomies for construction activities in terms of expected worker/equipment roles, and sequence of activities. The complexity of the visual stimuli in activity recognition in terms of camera motion, occlusions, background clutter, and viewpoint changes are other existing challenges. Finally and most importantly, there is lack of data sets together with ground truth for more exhaustive research on automated activity recognition methods. Crowdsourcing To overcome the current limitations in the standard task of activity recognition, the computer-vision community has recently initiated several projects to investigate the potential of crowdsourcing. Crowdsourcing refers to collaborative participation of a crowd of people to help solve a specific problem and typically involves a rewarding mechanism, for example, paying for participation (Howe 2008). In recent years, Internet has enabled crowdsourcing in a broadened and a more dynamic manner. Crowdsourcing from anyone, anywhere, as needed is now common for image and video processing, information gathering, and data verification to creative tasks such as coding, analytics, and production development (Wightman 2010; Yuen et al. 2011; Shingles and Trichel 2014). The wide range of business in crowdsourcing has also promoted the development of specialized platforms: • simple, microtasks-oriented crowdsourcing: Amazon Mechanical Turk and Elance; • complicated, experience-oriented crowdsourcing: 10EQS and oDesk;

04015035-2

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

• open-ended, creative crowdsourcing: IdeaConnection and Innocentive; and • funding, consumption, and contribution crowdsourcing: Indiegogo, Kickstarter, and Wikipedia (Shingles and Trichel 2014). Among all platforms, the Amazon Mechanical Turk, introduced by Amazon in 2005, has gained popularity within the computervision community. Traditionally, computer-vision researchers had to manually create substantially large amounts of annotations (i.e., labeled training data). This was always considered as a simple but labor-intensive and costly process. However, with the assistance of AMT marketplace, researchers, known as requester, only need to post entire annotation tasks in the form of microtasks or human intelligence tasks (HITs) and compensate online users, known as worker or annotator, with predefined payment for the rendered results. Through crowdsourcing, easy human annotation tasks which are extremely difficult or even impossible for computers to perform can be accomplished in a timely manner. This is by leveraging the AMT’s massive labor market which contains hundreds of thousands of annotators who can complete 60% of HITs in 16 h and complete 80% of HITs around 64 h (Ipeirotis 2010). The task can also be done at a low price of $5 per hour on the AMT. A reasonable quality can also be achieved. This is because the AMT keeps track of the annotators performance and uses that as a screening condition to guarantee appropriate level of quality in results. Also, many techniques are developed to assist requesters with getting high-quality annotations. Sorokin and Forsyth (2008) is the first within the computervision community to present a data annotation framework to quickly obtain inexpensive project-specific annotations through the AMT crowdsourcing platform. The proposed framework revolutionized large-scale static image annotation (Vondrick et al. 2013). The subsequent efforts to seek the value of massive data sets of labeled images promoted design of efficient visual annotation tools and databases. Russell et al. (2008) introduced LabelMe as a web-based annotation tool that supports dense polygon labeling on static image. Deng et al. (2009) presented a crowdsourcing image annotation platform based on ImageNet that is an image database of over 11 million images. Everingham et al. (2010) described a high-quality image collection strategy for the PASCAL visual object classes (VOC) challenge. Crowdsourcing Video Annotations Despite the success of crowdsourcing annotations for still imagery, the dynamic nature of videos makes their annotation more challenging (Vondrick et al. 2013). Video annotation requires costaware and efficient methods instead of frame-by-frame labeling (Vondrick et al. 2013; Wah 2006). Ealier works such as ViPERGT tool (Doermann and Mihalcik 2000) gathered groundtruth video data without any intelligent method for assisting the annotation task (Di Salvo et al. 2013). Obviously, the significant number of frames in a video requires smarter ways for propagating annotations from a subset of keyframes; otherwise, a method will not be scalable. More recently, Yuen et al. (2009) introduced LabelMe Video which employs linear interpolation with constant 3D-velocity assumption to propagate nonkeyframe spatial annotations in a video. Ali et al. (2011) presented FlowBoost, a tool that can annotate videos from a sparse set of keyframe annotations. Kavasidis et al. (2012) proposed GTTool to support automatic contour extraction, object detection, and tracking to assist annotations across video frame sequences. Vondrick et al. (2013) presented a video annotation tool, VATIC, to address the nonlinear motions in video. Beyond spatial aspects of video annotation, recently, researchers have also focused on extracting temporal information © ASCE

for video annotation. Gadgil et al. (2014) presented a web-based video annotation system to analyze real-time surveillance videos by annotating each video events’ time interval for forensic analysis. Heilbron and Niebles (2014) introduced an automatic video retrieval system to annotate the time interval of videos that contain the interested activities. Kim et al. (2014) presented ToolScape for crowdsourcing temporal annotation in videos. Despite popularity and the benefits of crowdsourcing computervision tasks, directly applying it to the task of video-based construction workface assessment can be challenging. Daily site videos exhibit different number of workers and equipment for various operations. Workers continuously interact with different tools, and both workers and equipment exhibit changing body postures even when the same activity is being performed. These issues beyond typical challenges in the task of activity recognition could negatively affect the optimal length of annotation tasks or the number of necessary keyframes for annotation. Also, because of the complexity of construction operations and the lack of formal categorization beyond CII defined activities, crowdsourcing video-based workface assessment necessitates a taxonomy and customized data frameworks to describe construction activities. Finally, the reliability of crowdsourcing for video-based construction workface assessment has not been validated. Particularly recruiting nonexpert annotators from a crowdsourcing marketplace such as AMT may negatively affect the quality of the assessment results. Beyond technical challenges, detailed experiments are necessary to examine the capability of nonexperts against expert control groups and also devise strategies for improving the accuracy of crowdsourcing tasks.

Method To address the challenges of applying crowdsourcing to videobased construction workface assessment task, a goal was set on creating a new web-based platform. The nonexpert crowd and specifically Amazon Mechanical Turk is heavily relied on to conduct workface assessment. Customized user interfaces have been designed to enable construction productivity data retrieval, visualization, and crossvalidation. Exhaustive experiments are designed and conducted to fine tune parameters of the tool including annotation method, annotation frequency, and video length for crowdsourcing video-based construction workface assessment as individual tasks on the AMT. A compositional structure taxonomy has also been created for construction activities to decode complex construction operations. The performance of the experts and nonexpert annotators for detecting, tracking, and recognizing activities are also exhaustively analyzed. Applying crossvalidation methods to improve workface assessment accuracy is also investigated. Particularly, several experiments are conducted to seek optimal fold number for the crossvalidation process. Fig. 1 shows an overview of the workflows involved in leveraging the proposed crowdsourcing platform for workface assessment. A companion video of this manuscript also shows various functionalities. In the following section, these modules and experiments are presented in more detail. Collecting Jobsite Videos Because of the lack of precedent experience on how to design and validate a crowdsourcing platform specific to video-based construction workface assessment, it is necessary to use real-world jobsite videos. This allows careful examination of the platform’s performance throughout the entire design and validation process. Videos chosen for collection focus on concrete placement operations which contain a range of visually complex activities and are common on almost all projects. This provides various validation

04015035-3

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Fig. 1. Workflow in the crowdsourcing workface assessment tool (images by authors)

scenarios for examining the application of the workface assessment tool. The collection of videos follows two principles: (1) to guarantee the videos cover various jobsite and video-recording conditions; and (2) the videos exhibit different level of difficulty for activity annotations. In this research, activity difficulty is defined as physical and reasoning efforts that are required to complete video-based workface assessment using the workface assessment tool. To quantify the difficulty level in video annotations, the following criteria is introduced: • construction activity conditions: the size of the construction crew, and the frequency of changes in sequence of activities conducted by individual craft workers; • visibility conditions: different occlusion conditions, illumination condition, and background clutter; and • recording conditions: camera viewpoint, distance, and cameramotion conditions. These criteria help determine the amount of the physical effort and reasoning that is needed to perform the assessment tasks. As shown in Table 1, the collected videos exhibit a large range of changes based on these criteria. This allows the capability of conducting construction workface assessment to be thoroughly investigated under different conditions. Real-world videos of concrete placement operations from three different construction sites were collected. For validating the platform, a total of eight job site videos (45 min) were chosen. Based on the criteria defined in Table 1, these videos were classified into categories as easy, normal, and hard. Fig. 2 shows example snapshots from these videos and their levels of difficulty. The collected concrete placement videos cover almost all types of direct work,

such as place concrete, erect rorm, position rebar, etc., and substantial amounts of nondirect work, such as preparatory work, material handling, waiting, etc. Activity Analysis User Interface The workface assessment tool has two main interfaces for task management and workface assessment. Task management interface assists requesters—site managers and engineers, or researchers—to manage workface assessment tasks including task publication and result retrieval. The workface assessment interface provides annotators access to complete video-based workface assessment tasks. As shown in Fig. 1, first, the requesters use task management interface to break an entire video of construction operations into several human intelligent tasks (HITs) and publish them online or offline. Online mode allows the AMT annotators to complete HITs, whereas offline mode makes HITs only accessible to a predefined group of users which can, for example, include expert annotators invited by the requesters. Before publishing the videos for annotation, for privacy purposes, the videos can be processed using image analysis methods to automatically blur human faces when needed. Once logged in, the annotators can then accept published HITs to generate workface assessment results using the Workface Assessment Interface. When all HITs belonging to the same video are completed, the requesters can then retrieve, visualize, and crossvalidate assessment results. They can also generate formal assessment reports through the task management interface. To access the task management interface, a requester should provide a set of valid username and password, as shown in Fig. 3. This verification step can address copyright and privacy issues of

Table 1. Various Criteria Used to Select the Candidate Videos for Experiments Activity conditions Videos Easy videos

Normal videos

Hard videos © ASCE

Visibility conditions

Crew size (person)

Changes in activities

Occlusion

2 4–5 3–5 4–7 6–9 8 10–12

Low Low Low Medium High Low Medium

Rare Rare Rare Medium Rare Severe Severe

Recording conditions

Illumination

Background clutter

Viewpoint

Camera distance (m)

Camera motion

Daylight Daylight Cloudy Daylight Daylight Sunny Daylight

Low Normal Low Medium Medium Varied Sever

Level Level Level Level Level Level Tilted

3–15 5–15 15–25 3–25 5–25 5–25 15–45

None None Rare Rare Rare Severe Rare

04015035-4

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Fig. 2. Videos exhibit different levels of difficulty for video annotation purposes (images by authors)

any uploaded video to protect authentic requesters from sharing videos under their accounts to unwanted parties. Once passed the verification step, the requester can use the following functions associated with the task management interface: • Video links: This function assists the requesters in managing the progress of the annotators and control the quality of their work on the workface assessment HITs. To do so, the HITs are presented by unique names and hyperlinks and are also classified in two categories of published and completed. • Video upload: The tool is prototyped to simultaneously support crowdsourcing workface assessment and collection of largescale ground-truth data set for both academia and industry. Thus, allowing requesters to upload their own videos is integral to the design of task management interface. To achieve this,

video upload is designed, shown in Fig. 4, as a function for the requesters to upload their videos and also automatically break them into several HITs based on a desired length. This function also associates most frequently used labels to the HITs and then publishes them online or offline. Most frequently used labels are compositional structure taxonomies that describe the activities of the workers or equipment in a video. • Crossvalidation and accuracy: The online mode for the tool is prototyped to leverage the knowledge of the nonexpert annotators from the AMT marketplace. To avoid using inaccurate results produced by these annotators, quality assurance and control should be applied to both preassessment and postassessment steps. To do so, crossvalidation and accuracy is introduced as functions that report the accuracy of the completed assessment

Fig. 3. Task management interface: (a) log in; (b) video links © ASCE

04015035-5

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Fig. 4. Task management interface: (a) video upload; (b) crossvalidation and accuracy

against groundtruth at the postassessment step. They also allow the requester to crossvalidate different completed assessments for the same video to achieve a more accurate assessment result. Fig. 4 presents the interface for these functions. • Video visualization: To retrieve activity analysis information at any desired level of granularity, video visualization presents the workface assessment results in forms of annotated videos, crew-balance charts, or pie charts. The annotated videos, shown in Fig. 5, contain worker/equipment trajectories and activity type information which are both annotated on the videos. The experts can monitor specific operations and conduct the rootcause analysis for hidden issues that may not easily be revealed by other visualization forms. The crew-balance charts represent the time series of worker activities. Unlike annotated videos, the time series provide a depiction of the construction operations in a concise manner. Finally, the pie charts characterize the percentage of time spent on various activity categories, based on the CII 2010 taxonomy. This method is also capable of accurately examining the hourly average percentages for the overall jobsite which can provide significant benefit to activity analysis. This worface assessment interface is where the annotators work on the assigned HITs. It is the most important component of the tool as it is where the actual workface assessment happens. To achieve simplicity and functionality in the design of the interface, structured rules are followed. The interface consists of three main components: (1) video player, (2) label drop-down list, and (3) assisting functions, shown in Fig. 6. The player–taking the most space–shows video content that requires assessment. The drop-down list contains labels that are presented hierarchically to describe activities of each worker or equipment. Assisting © ASCE

functions are located above or under the video player and label drop-down list. The main functions include • Workface assessment function: Using this function, in each assigned HIT, the annotators can annotate the labels necessary for the activities of the workers and equipment and their body posture. This involves (1) associating a role to each new construction worker/equipment, (2) drawing bounding boxes to localize the worker/equipment in 2D video frames, and (3) selecting labels to describe their activity types, body posture, and tools. As the video proceeds, this procedure continues with updating the position of the bounding boxes and their associated labels. To create new roles for the workers, the annotators can use +NewResource button which brings out a list of existing roles. After selecting a role, the cursor will be activated to allow the annotator to draw a bounding box around the construction worker or equipment and pinpoint the location in 2D. Next, the annotators select labels from the drop-down list to describe the observed activities. The Play and Rewind buttons together with the video progress control bar allows the role/activity/ posture labels to be updated for the observed workers and equipment in a video. Upon completion, the annotator can save the annotation results by pressing on the SaveWork button. • Assisting function: This function is designed to help annotators in generating accurate workface assessment results. This function consists of Introductions, +NewLabel, and Options. Introductions are provided to help the annotators understand how the assigned tasks can be performed. For efficiency and conciseness, the drop-down list only contains most frequently used labels which allows the annotators to quickly find the labels of interest. The platform does not require a comprehensive list of labels to begin with. Rather, the annotators can use +NewLabel

04015035-6

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Fig. 5. Video visualization in the task management interface: (a) annotated video; (b) crew-balance chart; (c) detailed and CII-type pie chart; the annotations in the upper left corner of each box shows the role-activity-tool-body posture of each worker (images by authors)

to insert the necessary new labels (or customize existing ones) to complete their assessments. The Options enables annotators to adjust video player and fine tune the monitoring settings. These settings include different video speeds, hide/show bounding box and labels on the videos, and enable/disable resizing of the bounding boxes. To keep the user interface constrained and simple, all assisting functions are displayed in different windows which can be triggered by their corresponding functional buttons. Compositional Structure Taxonomy for Construction Activities Activity analysis requires a detailed description of construction worker and equipment activities. A sequence of different activities allows the analysis on the root causes for low productivity rate and also planning and implementing productivity improvement. However, construction activities are complex to describe. They exhibit a large amount of interclass and intraclass variability among different activities that can be associated with the roles of the workers and equipment. Because of the dynamic nature of construction operations, the temporal sequence of these activities changes frequently as well. Without a systematic description, it will be difficult to provide accurate activity analysis information. To address the current limitation, the CII (2010) proposed a new taxonomy that classifies all activities into seven categories of direct work, preparatory work, © ASCE

tools and equipment, material handling, waiting, travel, and personal. Although this taxonomy is generally applicable to all construction operations, it does not provide detailed description of tasks that would be necessary for the development of visual activity–recognition algorithms. Any vision-based method requires distinct visual features on worker and equipment activities, and that could only be achieved if different labels are used to describe each group of similar visual features. In this research, a compositional structure taxonomy is introduced to decode complex construction activities in the following format: worker role is conducting CII activity category in form of a visual activity category using tool, body posture and is visible, occluded, or outside of the video frame. As a starting point, worker type contains 19 different roles for construction worker such as concrete finisher, carpenter, electrician, bricklayer, etc. The second layer is CII activity categories which describes worker activities in form of direct and nondirect work (preparatory work, tools and equipment, material handling, waiting, travel and personal activity). The third layer is the visual activity category introduced to provide detailed information on activities, tools, and body posture on direct work activities. Tools vary based on different types of activities; nevertheless, they can be an important visual indicator for training vision-based algorithms. Because of the large number of tools involved in any given activity, the interface is designed such that it provides illustrative images of each tool to help annotators to identify them easily. Posture can also indicate an activity

04015035-7

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Fig. 6. Workface assessment interface (images by authors)

has changed, and thus, it can be beneficial for the training of the vision-based algorithms. This visual activity category layer with detailed representation on activity-tool-body posture can enable extraction of proper visual features and devising appropriate computer-vision methods. The method also describes nondirect works with worker body postures to enable a better synthesis of nondirect work activities. Finally, jobsite videos contain severe occlusions and background clutters can create noise in a representative activity data set. Thus, visibility information of each annotation is associated—occluded and outside of video. Because of space limitations, Fig. 7 presents only a part of the proposed compositional structure taxonomy of construction worker activities related to concrete placement operations. A more detailed representation is available at http://activityanalysis.cee.illinois.edu/. Using the proposed taxonomy, construction professionals can analyze both CII and task level activities for root-cause productivity analysis purposes. To better plan productivity improvements, the interactions between task level activities and construction worker’s posture and tool is introduced. Not only are these interactions meaningful for development of robust vision-based algorithms, but they also enable construction professionals to analyze the relationship between tool utilization and direct work rate that could lead to productivity improvements through better tool utilization. The designed platform is also very flexible. For example, for quick assessments, the requester can choose role-CII activity structure to annotate construction videos based on the CII taxonomy of activities. For detailed workface assessment and collecting data sets for the development of computer-vision methods, the requester can choose to leverage the worker-activity-posture-tool-visibility compositional structure taxonomy of construction activities. To facilitate annotation process and minimize time required to find a specific activity category from a long list, the platform automatically inserts the most frequently used compositional structure taxonomy during the task publication stage. To cope with the needs © ASCE

for different types of construction operations, it also enables the annotators to add missing or new taxonomies on the fly. Extrapolating Annotations from Keyframes The dynamic nature of videos makes frame-by-frame annotations necessary but labor-intensive and costly. Crowdsourcing can reduce human efforts, time, and cost for workface assessment; nevertheless, video annotation still needs strategies to propagate assessment results from a sparse set of keyframes. Keyframes are frames in a video sequence that benchmark the start and end of a construction activity (or a change in role). In the platform, these changes need to be captured manually by the annotators. The nonkeyframes are the following frames that contain the same construction activities as the previous keyframe, although the position of the workers or equipment performing the activity may have changed. In this section, the extrapolation methods that are developed to support propagation of the annotations from the keyframes to nonkeyframes are described. Inspired by Vondrick et al. (2013), linear and detection-based extrapolation methods are implemented. T is defined as the total number of frames where T ¼ time × ðframe per secondÞ and B [Eq. (1)] is defined as the 2D pixel coordinates of each annotation bounding box B ¼ ½xmin ; xmax ; ymin ; ymax 

ð1Þ

where xmin , ymax denote the coordinates of upper-left corner and xmax , ymin denote the coordinates of the lower-right corner of the bounding box. Bt ð0 ≤ t ≤ TÞ is defined as bounding box coordinates at time t. Fig. 8 shows examples of applying extrapolation methods to generate nonkeyframe annotations (Bt ) from known keyframe annotations (B0 and BT ). Linear extrapolation assumes that workers and equipment that have constant velocity in 3D will also keep their velocity

04015035-8

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

CII Activity Category

Visual Activity Category Atomic Activities Concrete placement

Concrete finisher

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Spreading, leveling and smoothing concrete

Direct work

Tool

Body Posture

Bucket Scoop Shovel Bull float Concrete spreader Concrete tamper Hand float Hand screed Hand trowel Power screed Power trowel Vibrator

Covering or protecting concrete Curing Molding expansion joint & edge Surfacing Cutting concrete

Grout pump Power sprayer Concrete edger Concrete groover Straightedge Broom Concrete brush Concrete saw Line tool

Bending, Sitting, Standing

Worker Type

Carpenter

Erecting/dismantal scaffold Erecting/stripping formwork Erecting/dismantling temporary structure Direct work Installing door & finishes

Assisting concrete pouring

Level Plumb rule Hammer Power saw Air reviter Claw hammer Chisel Hammer Sander Square Tape measure Volume pneumatic nail gun

Bending, Sitting, Standing

Preparatory work Material handling Tool and equipment Waiting Travel Personal

Bucket

Ironworker

Positioning rebar Tying rebar Direct work Cutting rebar

Rebar hickey Rod bending machine Hand tying tool Plier Power tying tool Power saw Metal shear Hacksaw Bar cutter

B.S.S*

Non-direct work**

Non-direct work** B.S.S*: Bending, Sitting, Standing Non-direct work activity categories**: preparatory work,material handling, tool and equipment, waiting, travel, personal

Fig. 7. Compositional structure taxonomy of construction worker activities

© ASCE

04015035-9

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Fig. 8. Extrapolation nonkeyframes’ annotation from keyframes (images by authors)

unchanged in 2D. Of course, similar to Yuen et al. (2009) in LabelMe Video, homography-preserving shape interpolation method can be applied to rectify linear extrapolation. However, the anecdotal observations indicate that most construction videos exhibit limited perspective effects and depth variation so that the 2D velocity follows the constant 3D velocity, and thus, linear extrapolations are applied. Assuming 2D constant velocity in both x and y direction, if a point in x direction is at B0 ðxmin Þ and then at BT ðxmin Þ, the Bt ðxmin Þ should follow Bt ðxmin Þ ¼ ðT − tÞ ×

BT ðxmin Þ − B0 ðxmin Þ T−0

ð2Þ

Applying all coordinates in keyframes’ bounding boxes to Eq. (2), the Bt can be calculated as Bt ¼ ðT − tÞ ×

BT − B 0 T

ð3Þ

To localize the position of the workers and equipment in nonkeyframes, the detection-based extrapolation method treats keyframe annotations as positive samples for training machine learning classifiers in computer-vision algorithms. To build a proper classifier for detecting workers and equipment, visual features should be properly formed. Shape-based feature descriptors, such as histogram of oriented gradients (HOG) with or without histogram of color (HOC) have gained popularity in worker and equipment detection (e.g., Park and Brilakis 2012; Memarzadeh et al. 2013). Thus, visual feature descriptors xi consisting of HOG and HOC features are built, as shown in the following equation:  xi ¼

HOG HSV

w;b;ξ

l X 1 arg max wT w þ C ξi w;b;ξ 2 i¼1

subject to

yi ½wT ϕðxi Þ þ b ≤ 1 − ξ i

ξ i ≥ 0; i ¼ 1; : : : ; l ð5Þ

where ϕðxi Þ is a kernel function that maps xi to a higherdimensional space, and C is the penalty parameter. The trained visual classifier with weight vector w (normal vector to the hyperplane that separates positive and negative detection features) will be used to detect construction workers and equipment for nonkeyframes’ annotation propagation. Because of the presence of frequent occlusions and background clutter on construction sites, it is very difficult to propagate nonkeyframe annotations at a 100% level accuracy. Therefore, the constrained tracking of Vondrick et al. (2013) is applied to reduce the error in detection of the workers and equipment. Constrained tracking finds the best candidate from all possible detections for each frame to constitute a path with minimum cost. The path with minimum score is defined as B0∶T ¼ B0 ; B1 ; : : : ; BT−1 ; BT , where B0 and BT are manually generated keyframes’ annotations, and B1 ; : : : ; BT−1 are automatically generated from the trained SVM visual classifier. The optimization problem is then defined as arg min b1∶T−1

T−1 X t¼1

U t ðBt Þ þ PðBt ; Bt−1 Þ

ð6Þ

where the unary cost U t ðBt Þ is defined by Eq. (7), and pairwise cost PðBt ; Bt−1 Þ is defined by Eq. (8)

 ð4Þ

where HOG is computed based on Dalal and Triggs (2005), and HOC is a nine-dimensional feature containing three means and six covariance computed from hue, saturation, and value color channels. xi is applied to keyframe annotations to construct positive samples and to automatically extract patches from keyframes’ background to construct negative samples. To learn a specific visual classifier that is able to assign positive samples with high scores, the same procedure is followed in Memarzadeh et al. (2013), and a binary support vector machine (SVM) is introduced per type of resource because SVM is a commonly used discriminative classifier in machine learning literature and performs well with carefully created groundtruth data. Each binary SVM classifier is trained by feeding all training samples ðxi þ 1Þ=ðxi ; −1Þ to optimize the following objective function: © ASCE

min

2 min½−w · ϕðBt Þ; α1  þ α2 kBt − Blin t k

ð7Þ

PðBt ; Bt−1 Þ ¼ α3 kBt − Bt−1 k2

ð8Þ

The unary cost U t ðBt Þ calculates the cost of the potential detection in each frame by the score of the visual classifier and l-2 norm of its bounding box difference between the SVM detection and the linear extrapolation. The SVM associates the most possible prediction with the highest score to minimize the most likely cost for the detection. In this research, −w · ϕðBt Þ is used as the score of the visual classifier. Because of the presence of occlusions, some video frames t may not contain a groundtruth detection. These frames cause false negatives with small scores to be the potential Bt . In such situations, the annotations for nonkeyframes will rely on the linear extrapolation method and replaces classifier score −w · ϕðBt Þ with a very small (zero number) α1 . The pairwise cost calculates the smoothness of the detection path for each worker/equipment. In this research, the position of

04015035-10

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

the bounding box does not change if the camera motion is minimal. Thus, a true path should have the minimum pairwise cost among all possible candidates. This pairwise cost has been adopted as a gauge to test and select the best candidates for the path in each frame.

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Annotating Multiple Workers and Equipment Workface assessment videos always contain a few crew members and possibly an equipment. Thus, efficient annotation method is imperative to reduce time and guarantee quality for dense annotations. Three annotation methods are introduced that are categorized as one-by-one, all-at-once, and role-at-once. To explain each method, each task is defined on annotating a video with T frames and N construction workers/equipment. Using the all-at-once method, the annotators annotate or update the labels of all N workers and equipment in frame t (0 ≤ t ≤ T) simultaneously. This requires the annotators to watch the entire video prior to conducting the annotations. Using the one-by-one method, annotators annotate or update one worker or equipment for all T frames and then rewind the video to start for the next worker or equipment until all N workers and equipment are annotated. This method requires the annotators to watch the entirety of the video for N times. Finally, the role-at-once method assumes there are M different roles. It then requires the annotators to annotate or update the labels of all workers and equipment with the same type of role for all T frames and rewind the video to start the next group of workers/ equipment until all M role types are annotated. This method requires watching the video for M times. Previous works show that annotators on AMT tend to select the all-at-once approach as their primary annotation method. Compared with other methods, the all-at-once method may seem to save time by only requiring the annotator to watch the video once. However, this choice is not optimal for all conditions. For example, when annotating all workers at once, the annotators may lose track of each specific construction worker. Reviewing and correcting such labels would require additional time. In this research, focusing only on one worker/equipment may increase the familiarity of the annotator and ultimately save time during the annotation updating process. One should also note that time is not the only indicator that needs attention. The trade-off between time and accuracy could also affect the final annotation results. To quantify the time, accuracy, and their trade-off relationship in real application scenarios, experiments are conducted to examine the most efficient annotation approach. The results will be discussed in the “Experimental Results” section.

 Costt ðN R ; N I Þ ¼

Quality Assurance and Quality Control Methods for AMT The AMT is a marketplace of hundreds of thousands of annotators for solving microtask (HITs) quickly and effortlessly. However, because of poor or malicious judgment of the annotators, quick assessments may lead to erroneous results. To lower the risk of obtaining low quality results, AMT annotators are classified into the following: (1) skilled annotators who posses the ability to provide accurate workface assessment results; (2) ethical annotators who are honest but may be incapable of providing results with high accuracy because of poor judgments; and (3) unethical annotators who only try to finish as many tasks as possible in a random fashion just be able to earn money. To reject the unethical annotators and improve the accuracy of skilled and ethical annotators, preassessment and postassessment quality control steps are designed as follows: • Preassessment: This step is used to select skilled and ethical annotators and reject unethical annotators from the AMT marketplace by testing their performance. In the preassessment procedure, a short testing video is added—for which the ground truth has been previously generated—to the start of each HIT. At the very first time, the annotators are directed to this testing video. The platform compares these annotations with the groundtruth and reports the annotator’s accuracy for the given HIT. If the accuracy is above the requester’s predefined threshold, the annotator can continue to work on the actual HITs. Otherwise, the annotator is prohibited to work on any other HITs. • Postassessment: One should note that most if not all annotators are aware of the preassessment screening tests on the AMT marketplace (Ipeirotis et al. 2010; Snow et al. 2008; Le et al. 2010). This provides an opportunity for the unethical annotators to provide a perfect testing performance to get the pass and then generate random assessments for the actual HITs. Thus, besides quality assurance, quality control is needed to examine the accuracy of the actual assessments and to correct potential errors. This is done through a repeated-labeling approach. This requires a video to be annotated for multiple times by several annotators. Sheng et al. (2008) also leverages repeated-labeling to deal with noisy data for assessment quality improvement. To do so, a matching schema is defined that uses cost matrix to find corresponding annotations across repeated/multiple assessment results and then apply majority voting strategy to generate the final assessment. Matching schema randomly selects an annotation result as the reference and feeds reference and the remaining assessments to cost equation [Eq. (9)] to generate cost matrices for matching corresponding annotations

SRt ðN R Þ − SIt ðN I Þ;

if RoleNt R ≡ RoleNt I ; t ¼ 0; : : : ; T

2 × ½SRt ðN R Þ − SIt ðN I Þ;

otherwise

where Costt ðN R ; N I Þ is the cost between N R -th annotation from reference and N I -th annotation from input at frame t (0 ≤ t ≤ T); SRt ðN R Þ − SIt ðN I Þ is the area difference (bounding box overlap) between N R -th bounding box and N R -th bounding box; and RoleNt R and RoleNt I are the different annotation labels in comparison. The calculated costs are used to constitute the cost matrix for reference and input annotation results, which is shown in Fig. 9. Annotations from reference and © ASCE

ð9Þ

input are grouped based on the minimum cost value in the cost matrix. Once all groups have been matched, majority voting is performed to annotations of same construction worker/ equipment at each frame. For example, groups are first eliminated which have an annotation number less than half of the repeat time. Then, average bounding boxes’ coordinates AVGðBall t Þ of all repeated annotations for the same construction worker/equipment are calculated. Annotation(s) whose

04015035-11

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Input Annotations at frame t

1 2 … NI N I+1

1 Costt (1,1)

Reference Annotations at frame t NR 2 … Costt (2, 1) Costt (N R , 1)

N R+1 Costt (N R+1 , 1)

Costt (1,2)

Costt (2, 2)

Costt (N R , 2)

Costt (N R+1 , 2)

Costt (1, N I ) Costt (1, N I+1 )

Costt (2, N I ) Costt (2, N I+1 )

Costt (N R , N I ) Costt (N R , N I+1 )

Costt (N R+1 , N I ) Costt (N R+1 , N I+1 )





Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Fig. 9. Cost matrix at frame t

sum of Bt − AVGðBall t Þ is greater than the defined threshold at frame t are eliminated. Finally, the average of bounding boxes’ coordinates was recalculated and the majority label for each level of compositional structure taxonomy from the rest of annotations was selected to generate the final annotation for each construction worker/equipment—the annotation group—at frame t. Although repeated-labeling improves the quality of workface assessment, the unnecessary repeated labelings will cost extra money and time. To save money and time, it is necessary to find an optimal repeat time. Thus, experiments are conducted using crossvalidation method to examine how the accuracy will be changed based on different repeat times which are discussed in the following section. Designing Microtasks from Construction Videos A video of a construction operation is typically several hours long. This makes assessing it much more difficult than completing a typical short-length microtask on the AMT. To make crowdsourcing feasible, an entire video is broken down into several shorter HITs. For effectiveness, the following is considered: • The length of a HIT: The longer a HIT is, the less time the annotator will need to complete the task of annotation for the entirety of the video. This is because the annotator will spend less time on understanding the video content and would not require to watch a video for multiple times. However, longer HITs can lead to the tiredness and the boredom of the annotators which can in turn lower the accuracy and increase the annotation time. To study their trade-off relationship, experiments are conducted and experiment results are presented in the “Experimental Results” section. • Annotation frequency: Although dense labeling seems desirable for improving workface assessment accuracy, over labeling may lead to cost and time overrun. Manually annotating a sparse set of keyframes and applying extrapolation to generate annotations for nonkeyframes can provide the same level of accuracy with less time and cost. • Method of stitching several HITs together for deriving final assessment results: The platform breaks an entire operation video into multiple microtask HITs. To stitch the results of these HITs and derive the final workface assessment result, one-second overlap is placed between each HITs. Then, the same method used to match repeated annotation results is applied, as shown in Eq. (9), to all the overlapping annotations of the HITs. They are then stitched together to derive the final workface assessment results.

of 10þ nonexpert annotators—typically found on AMT—were assembled together with a control group of experts [5þ professionals, who in this experiment are (1) field engineers with experience on productivity assessments; and (2) students that took a course in productivity and have experience on productivity assessment]. All expert participants in this experiment have done at least productivity assessment for more than a one-day operation. Easy, normal, and hard videos of concrete placement operations introduced earlier in the paper were leveraged. With these videos, three separate experiments were conducted with the annotators from the controlled group of construction experts to investigate the impact of the different annotation methods, video lengths, and annotation frequencies on the accuracy of the workface assessment results. To validate the hypothesis that crowdsourcing video-based construction workface assessment through the AMT marketplace is a reliable approach, three additional experiments were conducted on the impact of the platform on accuracy by (1) comparing the performance of nonexpert annotators with the controlled group of construction experts; (2) testing the linear extrapolation and detection-based extrapolation methods and presenting the performance of each extrapolation method against ground truth data generated by the expert annotators; and (3) testing the performance of the postassessment quality control procedure and exploring the best repeated labeling times for desirable level of accuracy by experimenting crossvalidation with different randomly selected folds from both expert and nonexpert annotation results. To compare experiments and choose the parameters that achieve optimal performance, two validation criteria were chosen: (1) the annotation time spent to complete each experiment; and (2) the accuracy of the workface assessment results. Because the platform applies compositional structure taxonomy of construction activities, three separate discussions on accuracy are presented: 1. Completeness accuracy examines annotators capability to completely annotate all workers in a HIT video without missing one. 2. Bounding box accuracy investigates the accuracy of worker localization. In this case, 50% overlap is used between experimental assessment and groundtruth as the acceptable threshold for accuracy. 3. Tool accuracy examines the annotator’s capability to correctly label construction tools or the nondirect work categories. In the following section, each experiment is introduced and the experimental results are reported. Experimental Results

Results and Discussion Experiment Setup and Performance Measures To identify the key parameters of the platform and validate the effectiveness of crowdsourcing the workface assessment task, a pool © ASCE

Annotation Method The experiments for choosing the best annotation method include leveraging one-by-one, all-at-once, and role-at-once methods to annotate easy, Normal, and Hard videos. The time durations (in seconds) required for each annotation method across the expert

04015035-12

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Table 3. Average Accuracy of Each Annotation Method Methods

Completeness

Bounding box

Role

Activity

Posture

Tool

0.95 0.95 0.97

0.98 0.98 0.98

0.83 0.88 0.99

0.80 0.71 0.80

0.97 0.97 0.96

0.84 0.83 0.82

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

AM01 AM02 AM03

Fig. 10. Time spent using each annotation method for easy (0–15 min), normal (15–30 min), and hard (30–45 min) videos

annotator group are shown in Fig. 10. The average and total time for annotations are provided in Table 2. From these results, it is observed that when there is a small number of construction workers in a video, the one-by-one annotation method shows the best performance. As the number of workers increase, the all-at-one method performs best. The one-by-one annotation method requires a repeating video for each worker. Thus, it will take a longer time to annotate hard videos. It is also observed that when the activities of each construction worker change frequently, the one-by-one labeling method performs best. This is because this method will save time by allowing annotators to focus on one construction worker at a time and minimizes the chance of mistakes and the need for unnecessary revisions. In contrary, the high frequency of changes in activities of the workers overwhelms annotators who use the all-at-once or role-at-once methods. These methods will require extra time to be spent on revising activity categories because the annotators may then require to rewind the video frequently and make all annotations consistent with one another. When the worker activity categories do not frequently change, the all-at-once methods shows the best performance. The results in Table 3 show that the accuracy in assessment results for categorizing activities does not change significantly across different annotation methods. However, among all, the role-at-once method shows the highest accuracy in labeling the worker roles. This is because this method requires the annotators to reason about the role of the workers, and thus, it consequently increases the accuracy of labeling roles. Impact of the Video Length To examine the relationship between video length and the speed of workface assessment, an experiment was conducted using three

different video lengths of 10, 30, and 60 s. The experimental results, as shown in Fig. 11 and Table 4, indicate that increasing video length can reduce the annotation time. It is observed that videos with 60-s length require the least annotation time. Particularly the saving in the annotation time by posting 60-s videos for easy and normal videos is significant. For hard videos, presenting videos of 60-s length does not significantly save annotation time compared to posting videos with other time durations. This is because combining six 10-s video into one larger 60-s video allows the annotators to become familiar with the video content. This, in turn, reduces the time needed for interpreting the content multiple times. It also reduces the need for redrawing the bounding boxes. As shown in Table 5, it is observed that choosing videos with different length does not make a significant difference in the accuracy of the assessment results. However, compared with other video lengths, the 60-s video length exhibits lower accuracy for labeling tools (around 10% lower). This suggests that longer video length could possibly cause tiredness and boredom for the annotators, and this makes the accuracy lower when compared with the annotation of the videos that have a shorter length. Frequency of Choosing Keyframes for Annotation Annotating a sparse set of frames as opposed to annotating frames one-by-one can minimize the annotation time, but it may also negatively impact the accuracy of the assessment results. To explore the trade-off relationship between time and accuracy, three different fixed annotation frequencies of three times, five times, and nine times per minute were experimented with. In other words, the annotators were asked to only use these prefixed times per minute for their annotation purposes. The annotation times of each frequency are shown in Fig. 12, and the average and the total time duration of the annotations are presented in Table 6. The accuracy of each frequency is also provided in Table 7. From these experiments, the hypothesis that a sparse annotation can save time is validated. The sparser the annotations are, the less time annotation will require. The results particularly show that this method significantly shortens the annotation time for easy and normal videos that exhibit high frequency of changes in activity categories. Nevertheless, the gains are not as obvious for hard videos which exhibit a low frequency of change in these categories. Although choosing a smaller number of keyframes saves time required for annotation, it also lowers the accuracy in assessment results. As shown in Tables 6 and 7, the three times per minute

Table 2. Average and Total Annotation Time for Each Annotation Method Videos with different levels of difficulty in the annotation task Time Average

Total

Methods

Easy_01

Easy_02

Easy_03

Normal_01

Normal_02

Normal_03

Hard_01

Hard_02

Hard_03

AM01 AM02 AM03 AM01 AM02 AM03

550 363 478 — — —

592 836 922 8,380 9,034 10,943

534 607 788 — — —

1,491 1,404 1,736 — — —

1,317 1,276 1,507 20,841 22,801 24,459

1,359 1,879 1,647 — — —

1,307 871 1,174 — — —

1,084 823 908 18,232 12,495 18,298

1,254 804 1,576 — — —

Note: AM01 = one-by-one; AM02 = all-at-once; AM03 = role-at-once; time is measured in seconds. © ASCE

04015035-13

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Table 5. Average Accuracy in Workface Assessment Results of Each Video Length Lengths (s) Completeness Bounding box Role Activity Posture Tool 10 30 60

0.95 0.90 0.90

0.98 0.95 0.96

0.83 0.88 0.90

0.80 0.81 0.82

0.97 0.97 0.97

0.84 0.83 0.74

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Note: The bold font shows the maximum average accuracy in each category.

Fig. 11. Annotation time for easy (0–15 min), normal (15–30 min), and hard (30–45 min) videos

annotation frequency has the lowest average accuracy. This, however, does not indicate that requesters who prefer higher accuracies in their workface assessment should not choose to guide the annotators to use lower annotation frequencies. Rather, the experimental results indicate that the requesters can conduct a cost-benefit analysis for choosing accuracy versus the time needed for completing the workface assessment tasks. For example, results show that for 60-s videos, the three times per minute frequency is 66% faster than other annotation frequencies and results only in 7% reduction in the accuracy of the workface assessments. Expert versus Nonexpert Annotator To validate the reliability of crowdsourcing on the AMT marketplace, an experiment was conducted to compare the annotation time and accuracy of a large pool of nonexperts (10þ) against a controlled group of construction experts (5þ). Fig. 13 shows the difference in the annotation time between the nonexpert and expert annotators. Fig. 13 shows that on average, an expert is 22% faster than a nonexpert annotator. The accuracy of annotations between the nonexpert and expert control groups was compared and linear regression was used to interpolate all observation points. As shown in Figs. 14 and 15, the results testify that the experts, in general, perform slightly better than the nonexpert groups on the AMT. Fig. 16 shows the difference in accuracy in the categories of the percentage of completeness of work, the bounding box, body posture, role, activity, and tool. Although the nonexpert annotators produce higher accuracy in selecting roles, as shown at the top of Fig. 15, the difference is less than 2% and is not necessarily more advantageous. Table 8 shows the comparison in the accuracy of the workface assessment results between the nonexpert and expert groups. This difference is only

Fig. 12. Annotation time of each video annotation frequency: easy (0–15 min), normal (15–30 min), and hard (30–45 min) videos

about 3% better for the expert group. This encouraging result testifies that for video-based workface assessment, nonexpert annotators on the AMT have the potential to perform as well as the expert groups. Crossvalidation The tool applies repeated labeling as a postassessment quality control step. In this case, the requester can improve the accuracy in assessments and also minimize the risk in collecting inaccurate data. The idea is to crossvalidate the repeated labeling results across multiple AMT annotators to generate final assessment results with potentially higher accuracies. However, unnecessary repetitions lead to additional cost and can potentially increase the assessment time (unless the annotators work simultaneously). Either way, it is necessary to explore how many rounds of labeling are needed to guarantee satisfactory workface assessment results. Experimental results are presented in Figs. 16(a–c). To conduct a systematic experiment, the worst annotation result was chosen from each video category as the original annotation

Table 4. Average and Total Time Spent on Annotating Videos with Different Length Videos with different levels of difficulty in the annotation task Time Average

Total

Lengths (s)

Easy_01

Easy_02

Easy_03

Normal_01

Normal_02

Normal_03

Hard_01

Hard_02

Hard_03

10 30 60 10 30 60

550 488 279 — — —

592 655 377 8,380 8,349 5,050

534 527 354 — — —

1,492 1,292 806 — — —

1,317 1,230 635 20,841 21,784 12,520

1,359 1,835 1,062 — — —

1,307 662 513 — — —

1,084 529 436 18,232 10,227 7,812

1,254 850 614 — — —

Note: Time is measured in seconds. © ASCE

04015035-14

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Table 6. Average/Total Annotation Time Spent with Each Annotation Frequency

Time Average

Total

Videos with different levels of difficulty in the annotation task

Frequencies (times/min)

Easy_01

Easy_02

Easy_03

Normal_01

Normal_02

Normal_03

Hard_01

Hard_02

Hard_03

9 5 3 9 5 3

184 488 83 — — —

410 655 125 4,586 2,572 1,636

313 161 118 — — —

588 337 265 — — —

450 336 189 8,229 5,668 3,451

608 460 237 — — —

334 253 243 — — —

260 182 198 4,975 3,630 3,575

401 292 273 — — —

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Note: Time is measured in seconds.

(onefold), and then annotations from each video category were randomly selected to constitute threefold, fourfold, fivefold, sixfold, sevenfold, and eightfold crossvalidations. For easy, normal, and hard videos, it is observed that the increase in accuracy tends to be steady after a threefold crossvalidation. Figs. 16(a–c) show results for easy, normal, and hard videos respectively. Fig. 16(c) also shows that the activity accuracy of hard videos drop from a threefold to an eightfold crossvalidation. The accuracy drop after a sevenfold crossvalidation in Fig. 16(a) may result from the unnecessary repeated labelings which generate more erroneous data and thus cause the incorrect labels to stand out. The activity accuracy drop in Fig. 16(c) may also result from the unnecessary repeated labeling or result from the farther camera distance which can challenge the accuracy of observations. Based on the average performance of each fold crossvalidation accuracy, it was concluded that threefold crossvalidation can provide the optimal performance and that increasing the fold number increases the risk of producing erroneous data.

video conditions and annotation frequencies, two extrapolation methods were experimented on easy, normal, and hard videos with different annotation frequencies. Annotation frequency is indicated by average clicks per frame per construction worker,

Linear and Detection-Based Extrapolation Method To assist AMT annotators in generating nonkeyframe annotations, the tool enables linear and detection-based extrapolation methods. The extrapolation method treats user-assisted keyframe annotations as input and generates nonkeyframe annotations automatically. To examine the performance of each method under different Fig. 14. Percent of completeness, bounding box, and posture accuracy difference between expert and nonexpert annotators Table 7. Average Accuracy of Assessment Results for Different Annotation Frequencies Frequencies Bounding (times/min) Completeness box Role Activity Posture Tool 9 5 3

0.94 0.87 0.83

0.90 0.85 0.77

0.99 0.95 0.86

0.78 0.74 0.66

0.93 0.94 0.95

0.83 0.86 0.85

Note: The bold font shows the maximum average accuracy in each category.

Fig. 13. Annotration time difference between expert and nonexpert annotators © ASCE

Fig. 15. Percent role, activity, and tool accuracy differences between expert and nonexpert annotators

04015035-15

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Fig. 16. Crossvalidation results for (a) easy; (b) normal; and (c) hard videos

Table 8. Difference in Accuracies between the Expert and Nonexpert Annotators Category Completeness Bounding box Role Activity Posture Tool

Accuracy of an expert when compared with a nonexpert þ0.02 þ0.01 −0.01 þ0.03 0.00 þ0.03

which includes 0.001, 0.005, 0.01, 0.05, and 0.1. Figs. 17(a–c) present the error rates of each method against the ground-truth annotations. Fig. 17 illustrates that the increase in annotation frequency can lead to the decrease of error rate for both extrapolation methods. It is observed that increasing annotation frequency after 0.01 clicks per frame per construction worker could only marginally reduce error rate. This experiment indicates that dense annotation cannot necessarily guarantee higher accuracy in annotation performance. As indicated by the error rate for both extrapolation methods in Figs. 17(a and b), linear extrapolation performs as well as the detection-based extrapolation method, and the error rate difference is within 5% on average. However, it is also observed that linear extrapolation method performs much better than detectionbased extrapolation method in Fig. 17(c). This difference is likely caused by the difficulties in extracting effective visual features for hard videos and detecting workers in cluttered construction site conditions.

Discussion on the Proposed Method and Research Challenges The results validate the hypothesis that crowdsourcing construction activity analysis from jobsite videos on the AMT, a marketplace with nonexpert annotators, is a reliable approach for conducting activity analysis. In addition, the platform facilitates collection of large data sets with their ground truth that could be used for the development of computer-vision algorithms for automatic activity recognition. Particularly, it is shown that expert annotators are, on average, 22% faster than nonexpert annotators in terms of their annotation time. However, the accuracy of annotation among the nonexperts is within 3% of the accuracy of the expert groups. To fine tune the platform, the impact of different annotation methods, different HIT video lengths, and the frequency of requiring annotations were experimented on and discussed. Based on these experimental results, the following conclusions can be made: 1. The one-by-one annotation method works best with videos that have a small number of construction workers and high frequency of changes in activities, whereas the all-at-once annotation method works best with videos that have a high number of construction workers and low frequency of changes in work activities. 2. Increasing a HIT video length can reduce the annotation time. For example, the 60-s long videos save 47 and 37% annotation time compared to the 10-s and 30-s long videos. It was also observed that the accuracy of workface assessment results slightly improves with an increase in the HIT video length. 3. Manual annotation of a sparse set of video keyframe is reliable for achieving complete frame-by-frame annotations. At the most extreme case, the three times per minute annotation

Fig. 17. The error rates of the linear and detection-based extrapolation methods for annotating nonkeyframes for (a) easy; (b) normal; and (c) hard videos © ASCE

04015035-16

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

frequency reduces the average annotation time by 57% while dropping the accuracy of workface assessment only by 7%. The prominent decrease in annotation time by three times per minute could also result in cost savings because AMT charges the requester based on the time the annotator devotes to each HIT. 4. A threefold crossvalidation provides the best accuracy-cost trade-off for workface assessment. Increasing the fold number (beyond threefold) does not raise accuracy significantly. Also, quality assurance and control steps are important to guarantee the reliability of the assessment results. The repeated labeling can also improve the accuracy of workface assessment. The optimal performance can be achieved with a threefold (i.e., hiring three AMT annotators per HIT). 5. Increasing the sparsity of annotations by increasing the number of keyframes may not necessarily increase the accuracy of 2D localization because the increase in accuracy tends to plateau to þ0.05 clicks per frame per construction worker. Finally, increasing the number of keyframes may not necessarily lead to the significant increase in accuracy of 2D localization. The 2D localization accuracy tends to increase from 0.01 to 0.1 clicks per frame per construction worker; however, the increase is within 18%. Also, it was observed from the experiments that increasing keyframe number after 0.05 clicks per frame per construction worker barely leads to an increase in accuracy. 6. Linear extrapolation method can perform as well as detectionbased extrapolation method for easy and normal videos. However, because of the difficulties in extracting visual features and detecting construction workers in severe construction site conditions from hard videos, the detection-based extrapolation method fails to compete with the linear extrapolation method.

and emission probabilities between each pair of construction activities from a crowdsourcing platform can improve inference on the categories of subsequent activity types for each frame and in turn, improves the quality control process. To facilitate frequent implementation of crowdsourced activity analysis, short-term future work involves devising workflows to assist foremen to videotape their operations on a daily basis. This can be done by placing consumer-level cameras on tripods away from the operation of the crew such that for the most part, the crew stays within the line-of-sight of the cameras. For the long-term, automatically placing a relocatable network of cameras around the site and leveraging action cameras mounted on the hardhat or the safety vest of the workers can address the issues of conducting data collection throughout the site and also the line-of-sight and visibility for each camera. Meanwhile, as part of future work, the following will be considered: (1) plotting the workface assessment results over the course a day to get a better understanding on how soon crafts are getting on their tools in the morning, where there are excessive breaks, and where and when crews quit an operation too early; and (2) integrating observations from different viewpoints. Finally, to comprehensively validate this new method for construction videobased analysis, a set of detailed crowdsourcing market investigations and experiments should be conducted, not only to test the technical parameters, but also to build a process model to test the cost associated with crowdsourcing, the time span between publishing and retrieval tasks, and potential risks of affecting worker privacy by outsourcing construction video annotations containing construction workers to the crowd. This platform is publicly accessible at http://activityanalysis.cee.illinois.edu. A video is also provided as a companion for better illustration of the functionalities of this platform.

Conclusion and Future Work

Acknowledgments

This paper presents a novel method that supports crowdsourcing construction activity analysis from jobsite video streams. The proposed method leverages human intelligence recruited from a massive crowdsourcing marketplace, AMT, together with automated vision-based detection/tracking algorithms to derive timely and reliable construction activity analysis in different challenging conditions such as severe occlusion, background clutter, and camera motions. The experimental result with average accuracy of 85% in workface assessment tasks shows the promise of the proposed method. The comparisons conducted between nonexperts and construction validate the hypothesis that crowdsourcing video-based construction activity analysis through AMT nonexperts could achieve similar (or even the same) accuracy as conducting activity analysis by construction experts. To improve the platform, future work should focus on (1) the design of a more robust detection/tracking algorithm that can work well with sparse human input to effectively generate accurate nonkeyframe annotations; and (2) the design of a quality control method that does not require repeated labeling, to reduce requesters’ cost and avoid erroneous data at the voting stage. As part of the study, a new compositional structure taxonomy for construction activities is also created that models the interactions between body posture, activities, and tools. This representation can improve detection/tracking by enhancing the propagation of manual annotations to nonkeyframes. Also, studies that focus on using the hidden Markov model to automatically infer construction activities from long sequences of jobsite videos could be beneficial to detection/ tracking and quality control steps. Learning a set of transition

The authors would like to thank Zachry Construction Corporation and Holder Construction Group for their support with data collection. The authors thank Professor Carl Haas for his very constructive feedbacks during the development of the workface assessment platform. The technical support of Deepak Neralla with the development of web-based tool is appreciated. The authors also thank the support of real-time and automated monitoring and control (RAAMAC) lab’s members, the graduate and undergraduate civil engineering students, and other AMT nonexpert annotators. This work was financially supported by the University of Illinois Department of Civil and Environmental Engineering’s Innovation Grant. The views and opinions expressed in this paper are those of the authors and do not represent the views of the individuals or entities mentioned above.

© ASCE

Supplemental Data A video demonstration of the RAAMAC Crowdsourcing Workface Assessment Tool is available online in the ASCE Library (www .ascelibrary.org).

References Ali, K., Hasler, D., and Fleuret, F. (2011). “Flowboost—Appearance learning from sparsely annotated video.” 2011 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE, Colorado Springs, CO, 1433–1440.

04015035-17

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Azar, E. R., and McCabe, B. (2012). “Vision-based recognition of dirt loading cycles in construction sites.” Proc., Construction Research Congress, ASCE, Reston, VA, 1042–1051. Brilakis, I., Park, M.-W., and Jog, G. (2011). “Automated vision tracking of project related entities.” Adv. Eng. Inf., 25(4), 713–724. Cheng, T., Venugopal, M., Teizer, J., and Vela, P. (2011). “Performance evaluation of ultra wideband technology for construction resource location tracking in harsh environments.” Autom. Constr., 20(8), 1173–1184. Chi, S., and Caldas, C. H. (2011). “Automated object identification using optical video cameras on construction sites.” Comput.-Aided Civ. Infrastruct. Eng., 26(5), 368–380. CII (Construction Industry Institute). (2010). “Guide to activity analysis.” Rep. No. IR252-2a, Univ. of Texas, Austin, TX. Costin, A., Pradhananga, N., and Teizer, J. (2012). “Leveraging passive rfid technology for construction resource field mobility and status monitoring in a high-rise renovation project.” Autom. Constr., 24, 1–15. Dalal, N., and Triggs, B. (2005). “Histograms of oriented gradients for human detection.” IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, 2005 (CVPR 2005), Vol. 1, IEEE, San Diego, 886–893. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). “Imagenet: A large-scale hierarchical image database.” IEEE Conf. on Computer Vision and Pattern Recognition, 2009 (CVPR 2009), IEEE, Miami, FL, 248–255. Di Salvo, R., Giordano, D., and Kavasidis, I. (2013). “A crowdsourcing approach to support video annotation.” Proc., Int. Workshop on Video and Image Ground Truth in Computer Vision Applications, ACM, New York, 1–6. Doermann, D., and Mihalcik, D. (2000). “Tools and techniques for video performance evaluation.” Int. Conf. on Pattern Recognition, IEEE Computer Society, Hilton Head, SC, 4167–4167. Escorcia, V., Davila, M. A., Niebles, J. C., and Golparvar-Fard, M. (2012). “Automated vision-based recognition of construction worker actions for building interior construction operations using RGBD cameras.” Construction Research Congress 2012, ASCE, Reston, VA, 879–888. Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. (2010). “The pascal visual object classes (voc) challenge.” Int. J. Comput. Vision, 88(2), 303–338. Gadgil, N. J., Tahboub, K., Kirsh, D., and Delp, E. J. (2014). “A web-based video annotation system for crowdsourcing surveillance videos.” IS&T/ SPIE Electronic Imaging, International Society for Optics and Photonics, 90270A–90270A. Giretti, A., Carbonari, A., Naticchia, B., and DeGrassi, M. (2009). “Design and first development of an automated real-time safety management system for construction sites.” J. Civ. Eng. Manage., 15(4), 325–336. Golparvar-Fard, M., Heydarian, A., and Niebles, J. C. (2013). “Visionbased action recognition of earthmoving equipment using spatiotemporal features and support vector machine classifiers.” Adv. Eng. Inf., 27(4), 652–663. Golparvar-Fard, M., Pe˜na-Mora, F., Arboleda, C. A., and Lee, S. (2009). “Visualization of construction progress monitoring with 4D simulation model overlaid on time-lapsed photographs.” J. Comput. Civ. Eng., 10.1061/(ASCE)0887-3801(2009)23:6(391), 391–404. Gong, J., and Caldas, C. H. (2009). “An intelligent video computing method for automated productivity analysis of cyclic construction operations.” Proc., 2009 ASCE Int. Workshop on Computing in Civil Engineering, ASCE, Reston, VA, 64–73. Gong, J., Caldas, C. H., and Gordon, C. (2011). “Learning and classifying actions of construction workers and equipment using bag-of-videofeature-words and bayesian network models.” Adv. Eng. Inf., 25(4), 771–782. Gouett, M. C., Haas, C. T., Goodrum, P. M., and Caldas, C. H. (2011). “Activity analysis for direct-work rate improvement in construction.” J. Constr. Eng. Manage., 10.1061/(ASCE)CO.1943-7862.0000375, 1117–1124. Heilbron, F. C., and Niebles, J. C. (2014). “Collecting and annotating human activities in web videos.” Proc., Int. Conf. on Multimedia Retrieval, ACM, New York, 377. © ASCE

Hildreth, J., Vorster, M., and Martinez, J. (2005). “Reduction of shortinterval gps data for construction operations analysis.” J. Constr. Eng. Manage., 10.1061/(ASCE)0733-9364(2005)131:8(920), 920–927. Howe, J. (2008). Crowdsourcing: Why the power of the crowd is driving the future of business, 1st Ed., Crown Publishing Group, New York. Ipeirotis, P. G. (2010). “Analyzing the amazon mechanical turk marketplace.” Assoc. Comput. Mach. Mag. Stud., 17(2), 16–21. Ipeirotis, P. G., Provost, F., and Wang, J. (2010). “Quality management on amazon mechanical turk.” Proc., Association for Computing Machinery (ACM) Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) Workshop on Human Computation, ACM, New York, 64–67. Jaselskis, E., Sankar, A., Yousif, A., Clark, B., and Chinta, V. (2015). “Using telepresence for real-time monitoring of construction operations.” J. Manage. Eng., 10.1061/(ASCE)ME.1943-5479.0000336, A4014011. Kavasidis, I., Palazzo, S., Di Salvo, R., Giordano, D., and Spampinato, C. (2012). “A semi-automatic tool for detection and tracking ground truth generation in videos.” Proc., 1st Int. Workshop on Visual Interfaces for Ground Truth Collection in Computer Vision Applications, ACM, New York, 6. Khosrowpour, A., Fedorov, I., Holynski, A., Niebles, J. C., and GolparvarFard, M. (2014a). “Automated worker activity analysis in indoor environments for direct-work rate improvement from long sequences of rgb-d images.” Construction Research Congress 2014 Construction in a Global Network, ASCE, Reston, VA, 729–738. Khosrowpour, A., Niebles, J. C., and Golparvar-Fard, M. (2014b). “Visionbased workface assessment using depth images for activity analysis of interior construction operations.” Autom. Constr., 48, 74–87. Kim, J., Nguyen, P. T., Weir, S., Guo, P. J., Miller, R. C., and Gajos, K. Z. (2014). “Crowdsourcing step-by-step information extraction to enhance existing how-to videos.” Proc., 32nd Annual ACM Conf. on Human Factors in Computing Systems (CHI ‘14), ACM, New York, 4017–4026. Kim, J. Y., and Caldas, C. H. (2013). “Vision-based action recognition in the internal construction site using interactions between worker actions and construction objects.” International Association for Automation and Robotics in Construction (IAARC), Chennai, India, 661–668. Le, J., Edmonds, A., Hester, V., and Biewald, L. (2010). “Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution.” Proc., SIGIR 2010 Workshop on Crowd Sourcing for Search Evaluation, Microsoft, Geneva, 21–26. Memarzadeh, M., Golparvar-Fard, M., and Niebles, J. C. (2013). “Automated 2D detection of construction equipment and workers from site video streams using histograms of oriented gradients and colors.” Autom. Constr., 32, 24–37. Oglesby, C. H., Parker, H. W., and Howell, G. A. (1989). Productivity improvement in construction, McGraw-Hill, New York. Park, M.-W., and Brilakis, I. (2012). “Construction worker detection in video frames for initializing vision trackers.” Autom. Constr., 28, 15–25. Park, M.-W., Koch, C., and Brilakis, I. (2011). “Three-dimensional tracking of construction resources using an on-site camera system.” J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000168, 541–549. Peddi, A., Huan, L., Bai, Y., and Kim, S. (2009). “Development of human pose analyzing algorithms for the determination of construction productivity in real-time.” Construction Research Congress, ASCE, Seattle, 11–20. Pradhananga, N., and Teizer, J. (2013). “Automatic spatio-temporal analysis of construction site equipment operations using gps data.” Autom. Constr., 29, 107–122. Rezazadeh Azar, E., Dickinson, S., and McCabe, B. (2012). “Servercustomer interaction tracker: Computer vision-based system to estimate dirt-loading cycles.” J. Constr. Eng. Manage., 10.1061/(ASCE)CO .1943-7862.0000652, 785–794. Rezazadeh Azar, E., and McCabe, B. (2012). “Automated visual recognition of dump trucks in construction videos.” J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000179, 769–781.

04015035-18

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Downloaded from ascelibrary.org by University of California, San Diego on 12/19/15. Copyright ASCE. For personal use only; all rights reserved.

Russell, B. C., Torralba, A., Murphy, K. P., and Freeman, W. T. (2008). “Labelme: A database and web-based tool for image annotation.” Int. J. Comput. Vision, 77(1–3), 157–173. Sheng, V. S., Provost, F., and Ipeirotis, P. G. (2008). “Get another label? improving data quality and data mining using multiple, noisy labelers.” Proc., 14th Association for Computing Machinery (ACM) Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) Int. Conf. on Knowledge Discovery and Data Mining, ACM, New York, 614–622. Shingles, M., and Trichel, J. (2014). “Industrialized crowdsourcing.” 〈http:// dupress.com/articles/2014-tech-trends-crowdsourcing/〉 (Feb. 1, 2014). Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. Y. (2008). “Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks.” Proc., Conf. on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Stroudsburg, PA, 254–263. Sorokin, A., and Forsyth, D. (2008). “Utility data annotation with amazon mechanical turk.” IEEE Computer Society Conf. on Computer Vision and Pattern Recognition Workshops, IEEE, Anchorage, AK, 1–8. Teizer, J., Lao, D., and Sofer, M. (2007). “Rapid automated monitoring of construction site activities using ultra-wideband.” Proc., 24th Int. Symp. on Automation and Robotics in Construction, International Association for Automation and Robotics in Construction (IAARC), Chennai, India, 19–21.

© ASCE

Teizer, J., and Vela, P. A. (2009). “Personnel tracking on construction sites using video cameras.” Adv. Eng. Inf., 23(4), 452–462. Vondrick, C., Patterson, D., and Ramanan, D. (2013). “Efficiently scaling up crowdsourced video annotation.” Int. J. Comput. Vision, 101(1), 184–204. Wah, C. (2006). “Crowdsourcing and its applications in computer vision.” Univ. of California, San Diego. Wightman, D. (2010). “Crowdsourcing human-based computation.” Proc., 6th Nordic Conf. on Human-Computer Interaction: Extending Boundaries, ACM, New York, 551–560. Yang, J., Vela, P., Teizer, J., and Shi, Z. (2014). “Vision-based tower crane tracking for understanding construction activity.” J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000242, 103–112. Yuen, J., Russell, B., Liu, C., and Torralba, A. (2009). “Labelme video: Building a video database with human annotations.” 2009 IEEE 12th Int. Conf. on Computer Vision, IEEE, Miami, FL, 1451–1458. Yuen, M.-C., King, I., and Leung, K.-S. (2011). “A survey of crowdsourcing systems.” 2011 IEEE 3rd Int. Conf. on Social Computing (socialcom), IEEE, Boston, 766–773. Zhai, D., Goodrum, P. M., Haas, C. T., and Caldas, C. H. (2009). “Relationship between automation and integration of construction information systems and labor productivity.” J. Constr. Eng. Manage., 10.1061/ (ASCE)CO.1943-7862.0000024, 746–753.

04015035-19

J. Constr. Eng. Manage., 2015, 141(11): 04015035

J. Constr. Eng. Manage.

Related Documents


More Documents from ""