Unified Code Count (UCC)

The Unified Code Counter (UCC) is a comprehensive software lines of code counter produced by the USC Center for Systems and Software Engineering. It is available to the general public as open source code and can be compiled with any standard ANSI C++ compiler.

Unified Code Count
USC Unified CodeCount (UCC) v.201007. In-development GUI tool from the developer shown (currently unreleased).
Original author(s)Vu Nguyen
Developer(s)USC CSSE
Initial release2009 (2009)
Written inC++
Operating systemCross-platform
Available inEnglish
TypeFile comparison tool
LicenseUSC-CSSE Limited Public License
Websitesunset.usc.edu/research/CODECOUNT/

Introduction

One of the major problems in software estimation is sizing which is also one of the most important attributes of a software product. It is not only the key indicator of software cost and time but also is a base unit to derive other metrics for project status and software quality measurement. The size metric is used as an essential input for most of cost estimation models[1] such as COCOMO, SLIM, SEER-SEM, and PRICE TruePlanning for Software. Although source lines of code or SLOC is a widely accepted sizing metric, in general there is a lack of standards that enforce consistency of what and how to count SLOC.

Center for Systems and Software Engineering (CSSE) at the University of Southern California has developed and released a code counting toolset called the Unified CodeCount (UCC), which ensures consistency across independent organizations in the rules used to count software lines of code.

Block Diagram of UCC

The primary purpose is to support sizing software counts and metrics for historical data collection and reporting purposes. It implements a code counting framework published by the Software Engineering Institute (SEI) and adapted by COCOMO. Logical and physical SLOC are among the metrics generated by the toolset. SLOC refers to Source Lines of Code and is a unit used to measure the size of software program based on a set of rules.[2] SLOC is a key input for estimating project effort and is also used to calculate productivity and other measurements. There are two types of SLOC: physical and logical sloc. Physical SLOC (PSLOC)– One physical SLOC corresponds to one line starting with the first character and ending by a carriage return or an end-of-file marker of the same line. Blank and comment lines are not counted. Logical SLOC (LSLOC)– Lines of code intended to measure "statements", which normally terminate by a semicolon (C/C++, Java, C#) or a carriage return (VB, Assembly), etc. Logical SLOC are not sensitive to format and style conventions, but they are language dependent.

The Unified CodeCount(UCC) differencing capability allows the user to count, compare, and collect logical differentials between two versions of the source code of a software product. The differencing capability allows users to count the number of added/new, deleted, modified, and unmodified logical SLOC of the current version in comparison with the previous version.

History

Many different code counting tools existed in the early 2000s. However, due to the lack of standard counting rules and software accessibility issues, the Cost Analysis Improvement Group (NCAIG) at the National Reconnaissance Office identified the need for a new code counting tool to analyze software program costs. In order to avoid any industry bias, the CodeCount tool[3] was developed at the center under the direction of Dr. Barry Boehm, Merilee Wheaton, and A. Winsor Brown, with IV&V provided by The Aerospace Corporation. Many organizations including Northrop Grumman and The Boeing Company donated several code counting tools to the USC CSSE. The goal was to develop a public domain code counting tool that handles multiple languages and produces consistent results for large and small software systems.

Project plans are developed every semester, and graduate students from USC doing directed research are assigned projects to update the code count tool. Vu Nguyen, a PhD student at USC, led several semesters of student projects. All changes are verified and validated by the Aerospace Corporation IV & V team which works closely with the USC Instructor on the projects. The beta versions are tested by industry Affiliates, and then released to the public as open source code.

In 2006, work was done to develop a differencing tool which would compare two software system baselines to determine the differences between two versions of software. The CodeCount tool set, which is a precursor of UCC, was released in the year 2007. It was a collection of standalone programs written in a single language to measure source code written in languages like COBOL, Assembly, PL/1, Pascal, and Jovial.

Nguyen produced the Unified CodeCount (UCC) system design as a framework and the existing code counters and differencing tool were merged into it. Additional features like unified counting and differencing capabilities, detecting duplicate files, support for text and CSV output files, etc. were also added. A presentation on "Unified Code Count with Differencing Functionality" was presented in the 24th International Forum on COCOMO in October 2009.[4]

UCC tool has been released to the public with a license[5] enabling users to use and modify the code; if the modifications are to be distributed, the user must send a copy of the modifications to USC CSSE.

Importance

The Unified CodeCount (UCC) is used to analyze existing projects for physical and logical SLOC counts which directly relate to work accomplished. The data collected can then be used by software cost estimation models to accurately estimate time and cost taken for similar projects to get to a successful conclusion. There are many code count tools available in the market, however most have various draw backs such as:

  • Some are proprietary, others are public domain
  • Inconsistent or unpublished counting rules
  • May not be maintained
  • Many tools have different rules for counting giving inconsistent results

CSSE was approached by NCAIG to create a code counting solution developed by non-biased, industry-respected institution and which provides the following features:

  • Count software lines of code
  • Consistently
  • With documented standards
  • Ability to easily add new languages
  • Support and maintenance
  • Compare different baselines of software
  • Determine addition, modification, deletion
  • Identify duplicate files
  • Determine complexity
  • Platform independent
  • Command line interface
  • Modes: Code counting only or counting plus differencing
  • Counts multiple files and languages in a single pass
  • Output reports
  • Robust processing
  • Options to improve performance
  • Error log

The UCC is the result of that effort, and is available as open source to the general public.

Features

The Unified CodeCount Toolset with Differencing Functionality (UCC) is a collection of tools designed to automate the collection of source code sizing and change information. The UCC runs on multiple programming languages and focuses on two possible Source Lines of Code (SLOC) definitions, physical and/or logical. The Differencing functionality can be used to compare two baselines of software systems and determine change metrics: SLOC addition, deletion, modification, and non-modification counts.

The UCC toolset is copyright USC Center for Software Engineering but is made available with a Limited Public License which allows anyone to make modifications on the code. However, if they distribute that modified code to others, the person or agency has to return a copy to USC so the toolset can be improved for the benefit of all.

Uses of CodeCount

  • Counting Capabilities- UCC allows users to measure the size information of a baseline of a source program by analyzing and producing the count for:
a) Logical SLOC
b) Physical SLOC
c) Comment
d) Executable, data declaration
e) Compiler directive SLOC
f) Keywords
  • Differencing Capabilities- UCC allows users to compare and measure the differences between two baselines of source programs. These differences are measured in terms of the number of logical SLOC added/new, deleted, modified, and unmodified. These differencing results can be saved to either plain text .txt or .csv files. The default is .csv, but .txt can be specified by using the –ascii switch.
  • Counting and Differencing Directories- UCC allows users to count or compare source files by specifying the directories where the files are located.
  • Support for various Programming Languages - The counting and differencing capabilities accept the source code written in C/C++, C#, Java, SQL, Ada, Perl, ASP.NET, JSP, CSS, HTML, XML, JavaScript, VB, Php, VbScript, Bash, C Shell Script, ColdFusion, Fortran, Midas, NeXtMidas, Pascal, Ruby, X-Midas, and Python.
  • Command Arguments- The tool accepts user’s settings via command arguments. UCC is a command-line application and it is compiled under the application console mode.
  • Duplication- For each baseline, two files are considered duplicates if they have same content or the difference is smaller than the threshold given through the command line switch -tdup. Two files may be identified as duplicates although they have different filenames. Comments and blank lines are not considered during duplication processing.
  • Matching- When differencing, files from Baseline A are matched to files in Baseline B. Two files are matched if they have the same filename regardless of which directories they belong to. Remaining files are matched using a best-fit algorithm.
  • Complexity Count- UCC produces complexity counts for all source code files. The complexity counts may include the number of math, trig, logarithm functions, calculations, conditionals, logicals, preprocessors, assignments, pointers, and cyclomatic complexity. When counting, the complexity results are saved to the file "outfile_cplx.csv", and when differencing the results are saved to the files "Baseline-A-outfile_cplx.csv" and "Baseline-B- outfile_cplx.csv".
  • File Extensions. The tool determines which code counter to use for each file from the file extension.

Functionality of CodeCount

  • Execution speed:
CodeCount is written in C/C++, and utilizes relatively simple algorithms to recognize comments and physical/logical lines. Testing has shown the UCC to process acceptably fast except in extreme situations. A number of switches are available to inhibit certain types of processing if needed. Users may be able to compile using optimization switches for faster execution; refer to the users manual the compiler being used.
  • Reliability and Correctness
CodeCount has been tested extensively in the laboratory, and is being used globally. There is a defect-reporting capability, and any defects reported are corrected promptly. It is not uncommon for users to add functionality or correct defects and notify the UCC managers along with providing the code for the changes.
  • Documentation
The UCC open source distribution contains Release Notes, User’s Manual, and Code Counting Standards for the language counters. The source code contains file headers and in-line comments. The UCC Software Development Plan, Software Requirements Specification, and Software Test Plan are available upon request.
  • Ease of general maintenance
The UCC is a monolithic, object-oriented toolset designed to simplify its maintenance.
  • Ease of extension
The "CSCI" CodeCount flavor lends itself to ease of extension. Users are able to easily add another language counter on their own. Users may also specify which file extensions will select a particular language counter.
  • Compatibility
CodeCount is designed to be compatible with COCOMO estimation mechanism is required or desired.
  • Portability
CodeCount has been tested on a wide variety of operating systems and hardware platforms and should be portable to any environment that has an ANSI standard C++ compiler.
  • Availability of source code
Source code for CodeCount is available as a downloadable zip file.
  • Licensing
Source code for CodeCount is provided under the terms of the USC-CSE Limited Public License, which allows anyone to make modifications on the code. However, if they distribute that modified code to others, the person or agency has to return a copy to USC so the toolset can be improved for the benefit of all. The full text of the license can be viewed at UCC License.

Standards for the Language

The main objective for the Unified CodeCount (UCC) is to provide counting methods that define a consistent and repeatable SLOC measurement. There are more than 20 SLOC counting applications, each of which produces the different physical and logical SLOC count, with some 75 commercially available software cost estimating tools existing in today’s market. The differences in cost results from the various tools show the deficiencies of the current techniques in estimating the size of the code, particularly true for the projects of the large magnitude,[6] where cost estimation depends on automatic procedures to generate reasonably accurate predictions. This led to the need of a universal SLOC counting standard which would produce consistent results.

SLOC serves as a main factor for cost estimation techniques. Although it is not the sole contributor to software cost estimation, it does provide the foundation for a number of metrics that are derived throughout the software development life cycle. The SLOC counting procedure can be automated, requiring less time and effort to produce metrics. A well defined set of rules identify what to include and exclude in SLOC counting measures. The two most accepted measures for SLOC are the number of physical and logical lines of code.

In the UCC, logical SLOC measures the total number of source statements in a block of code. The three types of statements are: executable, declaration and compiler directives. Executable statements are eventually translated into machine code to cause run-time actions, while declaration and compiler directive statements affect compiler’s actions.

The UCC treats the source statements as independent units at source code level, where a programmer constructs a statement and its sub-statements completely. The UCC assumes that the source code will compile; otherwise the results are unreliable. A big challenge was to decide the ends of each statement for counting logical SLOC. The semicolon option may sound appealing, but not all the popular languages uses the semicolon (like SQL, JavaScript, UNIX scripting languages, etc.). The Software Engineering Institute (SEI) at Carnegie Mellon University and COCOMO II SLOC defined a way to count ‘how many of what program elements’. The table 1 and 2 illustrates the summary of SLOC counting rules[7] for logical lines of code for C/C++, Java, and C# programming languages. The UCC Code Counting Rules for each language are distributed with the open source release.

Measurement UnitOrder of PrecedencePhysical SLOC
Executable lines
Statements1One per line
Non-executable lines
Declaration (Data) lines2One per line
Compiler directives3One per line
Table 1. Physical SLOC Counting Rules


StructureOrder of PrecedenceLogical SLOC
SELECTION STATEMENTS:1Count once per each occurrence.
if, else if, else, "?" operator, try, catch, switchNested statements are counted in the similar fashion.
ITERATION STATEMENTS:2Count once per each occurrence.
For, while, do..whileInitialization, condition and increment within the "for" construct are not counted. i.e.
   for (i = 0; i < 5; i++)…

In addition, any optional expressions within the "for" construct are not counted either, e.g.

   for (i = 0, j = 5; i < 5, j > 0; i++, j--)…

Braces {…} enclosed in iteration statements and semicolon that follows "while" in "do..while" structure are not counted.

JUMP STATEMENTS:3Count once per each occurrence.
Return, break, goto, exit, continue, throwLabels used with "goto" statements are not counted.
EXPRESSION STATEMENTS:4Count once per each occurrence.
Function call, assignment, empty statementEmpty statements do not affect the logic of the program, and usually serve as placeholders or to consume CPU for timing purposes.
STATEMENTS IN GENERAL:5Count once per each occurrence.
Statements ending by a semicolonSemicolons within "for" statement or as stated in the comment section for "do..while" statement are not counted.
BLOCK DELIMITERS, BRACES6Count once per pair of braces {..},
except where a closing brace is followed by a semicolon, i.e.
 };.
Braces used with selection and Iteration statements are not counted. Function definition is counted once since it is followed by a set of braces.
COMPILER DIRECTIVE7Count once per each occurrence.
DATA DECLARATION8Count once per each occurrence.
Includes function prototypes, variable declarations, typedef statements. Keywords like struct, class do not count.
Table 2. Logical SLOC Counting Rules for C/C++, Java, and C#

Software design

The Unified CodeCount (UCC) produces the counting by capturing the LSLOC strings from a file based on a counting specification document created for each language; this specification is proposed as a standard. The differencing feature compares the LSLOC strings from the two files captured during the counting process with the help of a common engine.

UCC Architecture

The main architecture of UCC can be seen as a hierarchical structure of the following components:

Primary Classes of UCC.

1. MainObject

The MainObject is the top level class which performs the command line parsing, to extract the list of files from the command parameters and then reads each file into the memory for counting or differentiation. The MainObject calls the CodeCounters in order to process the embedded languages. The output of the counting function provides the following sets of files(.txt) for duplicate and counting/complexity results:

<LANG>_outfile.txt, is the file where Main displays the counting results for source files of <LANG>.<LANG> is the name of the language of the source files, e.g., C_CPP for C/C++ files and Java for Java files.
outfile_cplx.txt, which shows the complexity results for the source file.
Duplicates-<LANG>_outfile.txt, displays the list of duplicate files for the language <LANG>.
Duplicates-outfile_cplx.txt, contains the complexity results for the duplicated files.
DuplicatePairs.txt, is a text file listing matches between a source file and its duplicate file.

2. DiffTool

DiffTool is the derivative of MainObject, which parses the command line parameters and processes the list of files for each baseline. The DiffTool class provides the following sets of files(.txt,.csv) across baselines:

Baseline-<A|B>-<LANG>_outfile.txt, counts results for source files of <LANG> for Baseline A and Baseline B.
Baseline-<A|B>-<LANG>_cplx.txt, Complexity results for Baseline A and Baseline B.
MatchedPairs, A text file listing matches between files in Baseline A and Baseline B.
outfile_diff_results.txt, Main differencing results in the plain text format.
outfile_diff_results.csv, Main differencing results in .csv format that can be opened using MS Excel.

DiffTool performs the comparison between baselines, with the help of ‘CmpMngr’ class.

3. CmpMngr

CmpMngr calculates the differences by comparing two lists of LSLOC and determines the variations by calculating total LSLOC added, deleted, modified, unmodified from the two lists.

4. CCodeCounter

The CCodeCounter is used for pre-count processing, where it performs the following operations:

• Counts the blank lines and comments,
• Filters the literal strings,
• Counts the complexity of keywords, operators, etc
• Counts the compiler directive SLOC (using CountDirectiveSLOC method).
• Performs the language specific processing (creates sub classes).

Future enhancements and release

Future plans for UCC include improving complexity metrics computation, providing support for existing code counters and adding new counters for additional languages, better reporting, and improving performance. Counters for text, assembly, Cobol, Jovial, Matlab, and Pascal are in development. Also, a graphical user interface is being produced which may be used in place of the current command line interface.

System Requirements

A. Hardware

  • RAM: minimum 512 MB. Recommended: 1024 MB
  • HDD: minimum 100 MB disk space available. Recommended: 200MB.

B. Software Operating Systems

  • Linux 2.6.9
  • Unix
  • Mac OS X
  • Windows 9x/Me/XP/Vista
  • Solaris

C. Compilers Supported

  • ANSI C/C++ Compiler
gollark: Rust has a COOL™ `regex` crate which can actually compile regexes to finite automatons of some kind, thus performance.
gollark: > Alternatively, a regular language can be defined as a language recognized by a finite automaton.okay yes this is actually useful.
gollark: > In theoretical computer science and formal language theory, a regular language (also called a rational language[1][2]) is a formal language that can be expressed using a regular expressionhow helpful.
gollark: As in "regular languages"? It's a CS thing, I don't actually know what it means.
gollark: *Regular* expressions can't do that.

See also

References

  1. B. Boehm; C. Abts; S. Chulani. "Software development cost estimation approaches: A survey". Annals of Software Engineering, 2000.; B. Boehm; E. Horowitz; R. Madachy; D. Reifer; B. K. Clark; B. Steece; A. W. Brown; S. Chulani & C. Abts. "Software Cost Estimation with COCOMO II".
  2. Software Engineering Institute. "Software Size Measurement: A Framework for Counting Source Statements" (PDF). Technical Report CMU/SEI-92-TR-20 ESC-TR-92-020, 1992.
  3. "CodeCount, USC's Center for Systems and Software Engineering". Csse.usc.edu.
  4. "CSSE - Home". Csse.usc.edu. Retrieved 28 December 2018.
  5. "Archived copy". Archived from the original on 2011-03-06. Retrieved 2010-11-30.CS1 maint: archived copy as title (link)
  6. G. E. Kalb. "Counting Lines of Code, Confusions, Conclusions, and Recommendations" (PDF). Briefing to the 3rd Annual REVIC User’s Group Conference, 1990.
  7. "A SLOC Counting Standard" (PDF). Sunset.usc.edu. Retrieved 28 December 2018.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.