Google Summer of Code 2019 Final Report. The process of integration of Software heritage.

GSoC Final Report: Integration of Software Heritage in FOSSology

FOSSology is an open-source license compliance software system and toolkit. As a toolkit, a user can run license, copyright and export control scans from the command line. Users files can be imported, stored and tracked in FOSSology for compliance workflow experience. License, Copyright, and export scanners are tools available in FOSSology to help with the user's compliance.

The FOSSology system is a combination of agent runs in series to perform a specific task. Fossology has several agents like unpacking, license analysis, Copyright, etc. Agents are used to performing analysis or management tasks related to anything in the database. Each agent in FOSSology performs one task.

In this project, I was working on an agent to integrate the FOSSology in software Heritage. There were several steps to implement the feature. The total work is divided into four stages. The stages are dependent on each other and the total flow can be understood by the diagram as described below.

Images Basic Work flow

Basic Details

As we have seen in the above picture I have gone through four stages to complete the project

  • Calculate the hash values
  • Database Schema creation for software heritage
  • Agent creation and Data storing
  • Display the fetched records

Calculate the hash values

I was working on this section during my first evaluation. This section deals with the sha256 calculation of a file. Previously when a package is being uploaded in FOSSology the md5 and sha1 values are being calculated for the files of the package and is being inserted into the pfile table by the ununpack agent(which will run while uploading a package). My job was to alter the table pfile to add a pfile_sha256 column in it and to calculate the sha256 for each file and stored it in a new column in the same column. The development phases contain several lines of code changes and database schema alteration along with migration commands to run on the existing data in databases. In pfile table of the database, I have added one column named pfile_sha256. The structure of the column is:-

|column           | Type          |
|pfile_sha256     | character(64) |

The code for calculating hash values can be found at utils.c file of unpack agent and the process it does is similar to the code what is being shown bellow

    snprintf(command, PATH_MAX + 13, "sha256sum '%s'", CI->Source);
    FILE* file = popen(command, "r");
    if (file != (FILE*) NULL)
    {
      read = fscanf(file, "%64s", SHA256);
      retcode = WEXITSTATUS(pclose(file));
    }
    if (file == (FILE*) NULL || retcode != 0 || read != 1)
    {
      LOG_FATAL("Unable to calculate SHA256 of %s\n", CI->Source);
      SafeExit(56);
    }

The SHA256 is required to get the value from software heritage.

Commits

  • #d364193 feat(db): Calculate the sha256 value of the uploading file and store it in database

    Database Schema Creation for Software Heritage

    I was working on this feature on my second evaluation. It deals with the schema creation to store software heritage data. The data fetched from [Software Heritage]() is going to save in a table named software_heritage table. I and Mentors decided to store two types of data(origin and license) from [software heritage]() archive along with two more columns(primary key and pfile_fk). The mote is to relate each pfile data with software heritage table. The table structure can be seen as:- Table Structure

Column Type Modifiers
Software_heritage_pk integer not null default nextval('software_heritage_pk_seq'::regclass)
Pfile_fk integer not null
License text
Origin text
Foreign-key constraints:
    "software_heritage_pfile_fk_fkey" FOREIGN KEY (pfile_fk) REFERENCES pfile(pfile_pk) ON DELETE CASCADE

As you can see here the primary key is software_heritage_pk, pfile_fk is the foreign key and other two columns license and origin holds the records from the software heritage.

Creating Software Heritage Agent

This one is the most exciting and the heart of my project. I was working on this section during my second evaluation of GSoC. This feature includes calling the API all the features that the agent is going to perform and various sections like

  • Redundancy check feature
  • API calling feature
  • Storing the value in Software Heritage Table
  • Inserting the License Info in license_file table.
  • Registering the agent.
  • Basic Visualization of the data Let's discuss in a brief about all the things:

    Redundancy Check

    If the software heritage is returning a 404 HTTP Exception for a record then we are not inserting that record into software_heritage table. The application is made in such a way that a user can run the softwareHeritage agent as many time as they want to run. But the softwareHeritage agent will run on those files of a package whose records are not in the software_heritage table. As a result, the redundancy was reduced back to zero. I have added two sections to make it happen. One is in SoftwareHeritageDao and another one in softwareHeritage agent. In SoftwareHeritageDao getSoftwareHeritagePfileFk is taking uploadId as a parameter and returning the pfile ids of the files of a package whose records are there in software_heritage table. In softwareHeritage agent we call the API for those files whose records are not there. SoftwareHeritageDao

    /**
    * @brief Get all the pfile_fk stored in software heritage table
    * @param Integer $uploadId
    * @return array
    */
    public function getSoftwareHeritagePfileFk($uploadId)
    {
    $uploadTreeTableName = $this->uploadDao->getUploadtreeTableName($uploadId);
    $stmt = __METHOD__.$uploadTreeTableName;
    $sql = "SELECT software_heritage.pfile_fk AS pfile_fk  
            FROM  $uploadTreeTableName 
            JOIN software_heritage 
            ON $uploadTreeTableName.upload_fk = $1 
            AND software_heritage.pfile_fk = $uploadTreeTableName.pfile_fk";
    $rows = $this->dbManager->getRows($sql,array($uploadId),$stmt);
    $results = [];
    foreach ($rows as $row) {
        $results[] = $row['pfile_fk'];
    }
    return $results;
    }

    softwareHeritageAgent

        /*codes*/
        // Getting the pfile FKs
        $pfileFks = $this->shDao->getSoftwareHeritagePfileFk($uploadId);
        /*codes*/
        foreach(/*codes*/)
        {
            // C
            if(!in_array($pfileDetail['pfile_pk'],$pfileFks))
            {
               /*codes*/
            }
            $this->heartbeat(1);
        }

    API Calling and Storing Data

    The next step is to get call the API for the files and get the values from softwareHeritage. Then Api is being stored in the agent/softwareHeritage.conf file.

    api[url] = "https://archive.softwareheritage.org"
    api[uri] = "/api/1/content/sha256:"
    api[content] = "/license"

    We use GUZZLEHTTP to call the API. The API is called for the files of the package and the result is bing returned for further process. If it is returning 404 HTTP RESPONSE then a blank license array is being returned. and no values are being stored in the database along with the license_file data are getting stored too in the database. API Calling and Data Storing

    /**
     * @brief Get the license details from software heritage
     * @param String $sha256
     *
     * @return array
     */
    protected function getSoftwareHeritageLicense($sha256)
    {
        $client = new Client(['http_errors' => false]);
        $response = $client->get($this->configuration['api']['url'].$this->configuration['api']['uri'].$sha256.$this->configuration['api']['content']);
        $statusCode = $response->getStatusCode();
        if(200 === $statusCode)
        {
            $responseContent = json_decode($response->getBody()->getContents(),true);
            $licenseRecord = $responseContent["facts"][0]["licenses"];
    
            return $licenseRecord;
        }
        else
        {
            return [];
        }
    }
    
    /**
     * @brief Insert the License Details in softwareHeritage table
     * @param int $pfileId
     * @param array $licenses
     * @param int $agentId
     * @return boolean True if finished
     */
    protected function insertSoftwareHeritageRecord($pfileId,$licenses,$agentId)
    {
        foreach($licenses as $license)
        {
            $this->shDao->setshDetails($pfileId, $license);
            $l = $this->licenseDao->getLicenseByShortName($license);
            if($l != NULL)
            {
                $this->dbManeger->insertTableRow('license_file',['agent_fk' => $agentId,'pfile_fk' => $pfileId,'rf_fk'=> $l->getId()]);
            }
        }
        return true;
    }

    Registering the agent and Basic View

    The agent is being registered using the Agent plugin method as done in the previous agents. Through the agent-shagent.php the agent is getting registered and the same is getting reflected in the User Interface. The basic view is being shown in the license listing pages, file-browser pages.

    Commits

  • #63cfa7 feat(software-heritage): Create a software heritage agent and add the functionality
  • #b807e4 feat(db): Make table of software heritage to store information
  • #38f51a feat(software-heritage): Make the ui section of software heritage and register the agent
  • #b9c1fc feat(software-heritage): Make softwareHeritage dao function and add all the functionality related software_heritage table to it
  • #da4806 feat(softwareHeritageView): Show the details of software heritage in the license list page

    Displaying the Records

    I was working on this section during my third evaluation. We decided to have a separate view for the data under software_heritage section. The basic idea what we have got to display the result like a tabular view like we were doing in the file-browser section. The development process of the section includes two steps:-

  • creating the backend file structure
  • creating the frontend file structure

In the backend file structure, there were two files(softwareHeritage-plugin.php and AjaxSHDetailsBrowser.php). softwareHeritagePlugin is the basic request handlers which calculate basic frontend details like registering the menu, Getting the total number of records in a package, etc. Whereas the AjaxSHDetailsBrowser is an API which returns the file tree view along with hash value for each file and license details of each file. The softwareHeritage.html.twig display the records whereas softwareHeritage.js.twig calls the API and fill the table contains with the help of datatable.js.

Commits

  • #1fca099 feat(softwareHeritage): Make the view for softwareHeritage records for a package

    views

    Add Software Heritage Agent Sh agent software heritage Agent

Software Heritage Table SH table
License List Page License Table
File Info Page File Info

Additional Contribution

Apart from the main deliverables above, I also contributed a few other patches.

  • #e304e4 fix(vscode): Add vscode editor file to gitignore
  • #e514dc feat(ui):Add user description of available user in group management page

    Conclusion

    What tasks were accomplished

Task Planned Completed
Calculate the SHA256 value of the files of a package yes yes
Make the migration file to insert sha256 for previous files yes yes
Create the software heritage agent yes yes
Create a database table for software heritage details yes yes
Register the agent yes yes
Run the Agent yes yes
Redundancy check of the details yes yes
Store the details in software heritage table yes yes
Store the license details in license_file table no yes
Display the details fetched from software heritage yes yes

Future Work

Currently, we are not getting the origin value from the software heritage archive. When the software heritage archive makes that data public we need to add that section right after. The work involves adding the api and adding the same functionality.

What did I learn from Google Summer of Code

During my whole time period in Google Summer of Code, I have learned so many things such as:

  • Understanding the code base of Fossology
  • Exploring new features of Php and C while working on the various features.
  • Along with I have learned how to create one agent and working with databases in fossology application.
  • I have learned a new feature of handling user interface feature in fossology
  • My approach to solving a problem is also improved.
  • I have sharpened my knowledge on debugging and error correction process