Exploring SimFin API Using Exploratory Data Analysis

Analysis, in general, is the process of exploring a topic in particular by breaking and simplifying it into smaller sub-topics to gain insights and a better understanding. The process of evaluating a business and its associated resources such as projects, budgets, and other finance-related information is known as Financial analysis. Financial Analysis performed on a particular business helps us determine its performance and suitability for others. It also provides us with a better understanding of whether the entity, in particular, is profitable enough to provide a return on investment to its investors, is the entity stable and solvent or not. In other words, it is the process of determining the financial strengths and weaknesses of the entity by establishing a strategic relationship between the items of the entity, such as balance sheets, profit and loss account, and other financial statements.

Financial analysis is used to evaluate economic trends, set financial policies, help build long-term plans for business activities, and identify projects or companies for investment. This is done through the crunching of financial numbers and data. Financial analysis is generally conducted in both corporate finance and investment finance settings. There are two ways of conducting a financial analysis: Internal Analysis and External Analysis. When conducted Internally, the analysis can help managers and business heads make key decisions from the generated insights or review their historical trends to analyse and review what has worked for them in the past. On the other hand, Analysis, when conducted Externally, helps potential investors explore and choose the best investment opportunities or policies. 

When conducted, such analytical research also helps the top management determine the success or failure of the company’s operation, appreciating and appraising an individual’s performance and contribution to the organization, realising whether the firm’s resources are being used efficiently and evaluating the internal control system. Investors who have invested their money into the firm’s shares are always keen and interested in its performance earnings and future profitability. Financial analysis helps them predict and interpret the firm’s bankruptcy and failure probability of its business enterprises. After being aware of a probable failure, investors can take preventive measures that can help them to avoid and minimize financial losses.

Investing in stocks requires a careful analysis of the financial data to determine a company’s true worth. This can be time-consuming and somewhat cumbersome. The easier way to find out about a company’s performance is to look at its financial ratios, most of which are freely available on the internet. Data Analysts and Market Researchers can use one such library to explore financial data, known as Simfin.

What is Simfin?

Simfin is an open world crowdsourcing platform to explore and research fundamental financial data. All the present data is free, easily accessible, readily available and user-generated.  Simfin aims to make financial data freely available to private investors, researchers, and students. The data collection process is self-automated and combined along with machine learning technologies. With such available data services, SimFin aims to help those investors and researchers who cannot afford to have huge investments per year on data. SimFin tries to also maintain a check on their data by manually updating and expanding its datasets. SimFin makes it very easy to obtain and use financial and stock market data in Python. The easiest way of working with the entire datasets is to use the Python API for SimFin; it automatically downloads the share-prices through its in technology, known as signal and fundamental data from the SimFin server is fetched in no time, also saves the data to disk for future use, and loads the data into the Pandas DataFrames easily. It features data for more than 2000+ companies and standardised statements, and a stock screener with 70+ company ratios and shareable.

Using Simfin for Exploratory Data Analysis

We’ll be using Simfin’s Python API and exploring one of the many datasets that it contains. Our goal will be to use SimFin’s Stock Screener for NetNet stocks, better known as Net Net Working Capital stocks which focus on current assets, taking cash and cash equivalents at full value, then reducing accounts receivable for doubtful accounts, and reducing inventories to liquidation values. These are essential stocks where the current assets have been further discounted before subtracting all liabilities, preferred shares, and off-balance sheet liabilities. We will work on a defined hypothesis that a given valuation ratio or another financial signal can be used to predict future stock returns and find how we can know a relation between some financial signals and future stock returns?

The following is an official implementation recreated from SimFin’s Tutorial. You can find the link to their Github page here.

Getting Started with EDA

To get started we’ll first install the SimFin Library; you can use the following code. 

 !pip install simfin

Next we will install all the dependencies and also import the SimFin Python API,

 %matplotlib inline
 import pandas as pd
 import seaborn as sns
 import statsmodels.api as sm
 # Import the main functionality from the SimFin Python API.
 import simfin as sf
 # Import names used for easy access to SimFin's data-columns.
 from simfin.names import *
 Check SimFin Version, 
 # Version of the SimFin Python API.
Configuring the data director path and API path 
 # SimFin data-directory.

Setting key to use free data.

 # SimFin load API key or use free data.
 sf.load_api_key(path='~/simfin_api_key.txt', default_key='free')
 # Seaborn set plotting style.
Accessing the Data Hub

We will now use a StockHub object to load easily and process the US stock market financial data with these settings:

 # We are interested in the US stock-market.
 market = 'us'
 # Add this date-offset to the fundamental data such as Income Statements 
 # can be set to 1, 2 or even 3 months after the Report Date.
 offset = pd.DateOffset(days=60)
 # Refresh the fundamental datasets (Income Statements etc.) every 30 days.
 refresh_days = 30
 # Refresh the dataset with share prices every 10 days.
 refresh_days_shareprices = 10
 Accessing StockHub, 
 hub = sf.StockHub(market=market, offset=offset,
Using Data Signals 

First, we calculate financial signals for all the stocks, such as the Current Ratio, Debt Ratio, Net Profit Margin, Return on Assets, etc. These are calculated using and accessing data from the financial reports: Income Statements, Balance Sheets and Cash-Flow Statements, which are automatically downloaded and loaded by the data hub.

Defining Signal to be Fed and extract data, 

df_fin_signals = hub.fin_signals(variant='daily')

Now we will calculate growth signals for the stocks, such as Earnings Growth, FCF Growth, Sales Growth, etc. These are also calculated using data from the financial reports: Income Statements, Balance Sheets and Cash-Flow Statements, which are automatically downloaded and loaded by the data hub.

 Fetching growth signals, 
 df_growth_signals = hub.growth_signals(variant='daily') 

Now we will calculate valuation signals for the stocks, such as P/E, P/Sales, etc. Again, these are calculated from the share prices and data from the financial reports. Because the data-hub has already loaded the required datasets in the function-calls above, the data is merely reused here, and the data-hub can proceed directly to computing the signals.

Fetching valuation signals,

 df_val_signals = hub.val_signals(variant='daily')
 Combining all the signals into a single DataFrame:
 # Combine the DataFrames.
 dfs = [df_fin_signals, df_growth_signals, df_val_signals]
 df_signals = pd.concat(dfs, axis=1)
 Show the resultant dataframe :
Evaluating & Handling the Missing Data

Some of the data signals have a lot of missing data, which can cause problems in the statistical analysis. Let us first see the fraction of each signal column that is missing:

 # Remove all rows with only NaN values.
 df = df_signals.dropna(how='all').reset_index(drop=True)
 # For each column, show the fraction of the rows that are NaN.
 (df.isnull().sum() / len(df)).sort_values(ascending=False) 

We will get the following output :

 R&D / Revenue                        0.616803
 R&D / Gross Profit                   0.616652
 Return on Research Capital           0.616652
 Dividend Yield                       0.430889
 Inventory Turnover                   0.287232
 Debt Ratio                           0.259329
 Net Acquisitions / Total Assets      0.256498
 Sales Growth                         0.141619
 Sales Growth YOY                     0.136864
 FCF Growth                           0.133579
 Assets Growth                        0.133579
 Earnings Growth                      0.133579
 FCF Growth YOY                       0.127126
 Earnings Growth YOY                  0.127109
 Assets Growth YOY                    0.127109
 Interest Coverage                    0.100014
 Sales Growth QOQ                     0.046013
 CapEx / (Depr + Amor)                0.038795
 FCF Growth QOQ                       0.033883
 Earnings Growth QOQ                  0.033883
 Assets Growth QOQ                    0.033883
 P/Sales                              0.023100
 Gross Profit Margin                  0.019524
 Log Revenue                          0.017117
 Asset Turnover                       0.017060
 Net Profit Margin                    0.017060
 P/Cash                               0.016489
 Price to Book Value                  0.015124
 Earnings Yield                       0.015124
 FCF Yield                            0.015124
 Market-Cap                           0.015124
 P/E                                  0.015124
 P/FCF                                0.015124
 P/NCAV                               0.015124
 P/NetNet                             0.015124
 Quick Ratio                          0.010466
 Return on Assets                     0.009083
 Return on Equity                     0.009083
 Share Buyback / FCF                  0.009083
 Dividends / FCF                      0.009083
 Current Ratio                        0.009083
 (Dividends + Share Buyback) / FCF    0.009083
 dtype: float64 

As you can see from the above list, the Dividend Yield has missing data for nearly half of its rows, while the Debt Ratio has missing data for a third of its rows. Let us remove all signals that have more than 25% missing data:

 # List of the columns before removing any.
 columns_before = df_signals.columns
 # Threshold for the number of rows that must be NaN for each column.
 thresh = 0.75 * len(df_signals.dropna(how='all'))
 # Remove all columns which don't have sufficient data.
 df_signals = df_signals.dropna(axis='columns', thresh=thresh)
 # List of the columns after the removal.
 columns_after = df_signals.columns
 # Show the columns that were removed.
 We get the output as,
 Index(['Debt Ratio', 'Dividend Yield', 'Inventory Turnover',
        'Net Acquisitions / Total Assets', 'R&D / Gross Profit',
        'R&D / Revenue', 'Return on Research Capital'],
Calling the Screener for NetNet Stocks

We want to test whether the NetNet investment strategy might work, so we will only use the signal-rows with a P/NetNet ratio between zero and one, thus indicating the stock is trading at a discount to its NetNet liquidation estimate. We create this stock-screener by making a boolean mask as follows:

 mask_netnet = (df_signals[P_NETNET] > 0) \
             & (df_signals[P_NETNET] < 1) 

Rows that satisfy the screener condition have a value of True, and rows that do not meet the condition have a value of False.



 Ticker  Date      
 A       2007-01-03    False
         2007-01-04    False
         2007-01-05    False
         2007-01-08    False
         2007-01-09    False
 Name: P/NetNet, dtype: bool 

Let us plot how many different stocks have traded at NetNet-discounts in the past,


We can also show all the tickers that have traded at NetNet discounts at some point:

tickers_netnet = mask_netnet[mask_netnet].reset_index()[TICKER].unique()

print('Number of unique tickers:', len(tickers_netnet))


Output :

 Number of unique tickers: 139

Let us now compare the P/NetNet ratio to 1-day stock returns, 

 # Name of the new column for the returns.
 TOTAL_RETURN_1D = 'Total Return 1-Day'
 # Calculate 1-day stock-returns.
 df_returns_1d = hub.returns(name=TOTAL_RETURN_1D,
                             bdays=1, future=True, annualized=False)
 # Show the 1-day stock-returns.
 # Limit the daily returns between -10% and +10%
 df_returns_1d = df_returns_1d.clip(-0.1, 0.1)
 # Combine the signals and stock-returns.
 # We are only using the rows which are NetNet discounts.
 dfs = [df_signals.loc[mask_netnet],
 df_sig_rets = pd.concat(dfs, axis=1)
 We can then create a scatter-plot of the P/NetNet ratio versus the daily stock-returns:
 df_sig_rets.plot(kind='scatter', grid=True,
                  x=P_NETNET, y=TOTAL_RETURN_1D); 

Using colours to distinguish the dots for different tickers:

 sns.scatterplot(x=P_NETNET, y=TOTAL_RETURN_1D, hue=TICKER,
                 data=df_sig_rets.reset_index(), legend=False); 

We can do the same for 1-3 year returns as well.

 # Name of the new column for the returns.
 TOTAL_RETURN_1_3Y = 'Total Return 1-3 Years'
 # Calculate the mean log-returns for all 1-3 year periods.
 df_returns_1_3y = \
                          future=True, annualized=True,
                          min_years=1, max_years=3)
 # Combine the signals and stock-returns.
 # We are only using the rows which are NetNet discounts.
 dfs = [df_signals.loc[mask_netnet],
 df_sig_rets = pd.concat(dfs, axis=1)
 Plotting the graph, 
 sns.scatterplot(x=P_NETNET, y=TOTAL_RETURN_1_3Y, hue=TICKER,
                 data=df_sig_rets.reset_index(), legend=False); 
Removing Outliers from Data

A common method for removing outliers is the so-called Winsorization of the data. It asically just limits or clips the data between, e.g. the 5% and 95% quantiles of the data. 

 # Select all columns except for the P/NetNet ratio.
 columns = df_sig_rets.columns.drop(P_NETNET)
 # Winsorize all the other signals and stock-returns.
 df_sig_rets2 = sf.winsorize(df_sig_rets, columns=columns)
 # Winsorize all the other signals and stock-returns.
 # Instead of clipping values beyond the bounds, set them to NaN.
 df_sig_rets = sf.winsorize(df_sig_rets, columns=columns, clip=False) 
Generating a Correlation 

Linear correlation (aka. Pearson correlation) is a simple statistical measure of how two variables change together. We will study the linear correlation between the signals and stock-returns to roughly assess which signals might be the best predictors for stock-returns. We will also study the linear correlation between the signals themselves to assess whether some of the signals seem redundant and can be removed.

 # Calculate the correlation between all signals and stock-returns.
 df_corr = df_sig_rets.corr()
 # Show how the signals are correlated with the stock-returns.
 # Sorted to show the strongest absolute correlations first.
 df_corr_returns = df_corr[TOTAL_RETURN_1_3Y].abs().sort_values(ascending=False)

Output :

 Total Return 1-3 Years               1.000000
 Current Ratio                        0.309601
 Quick Ratio                          0.293966
 Log Revenue                          0.220491
 Market-Cap                           0.217647
 P/FCF                                0.198681
 Interest Coverage                    0.196469
 Share Buyback / FCF                  0.194303
 Earnings Yield                       0.177254
 (Dividends + Share Buyback) / FCF    0.169556
 FCF Yield                            0.166700
 Return on Equity                     0.154312
 Earnings Growth YOY                  0.148643
 Assets Growth QOQ                    0.136451
 Price to Book Value                  0.091554
 FCF Growth YOY                       0.090232
 Assets Growth                        0.089341
 FCF Growth                           0.084630
 Asset Turnover                       0.084033
 Dividends / FCF                      0.068013
 Gross Profit Margin                  0.062543
 Assets Growth YOY                    0.062482
 Return on Assets                     0.057145
 Sales Growth                         0.054878
 Earnings Growth QOQ                  0.042634
 Sales Growth YOY                     0.041231
 Net Profit Margin                    0.040865
 FCF Growth QOQ                       0.037546
 CapEx / (Depr + Amor)                0.020960
 P/NCAV                               0.020314
 Earnings Growth                      0.015977
 P/NetNet                             0.014188
 Sales Growth QOQ                     0.008207
 P/Cash                               0.006348
 P/Sales                              0.005994
 P/E                                  0.004789
 Name: Total Return 1-3 Years, dtype: float64 

We can also create a correlation-matrix between all the signals and the 1-3 year average stock-returns.

df_corr.round(2).style.background_gradient(cmap='seismic', axis=None)

Even for a small number of signals, the correlation matrix can easily get confusing. Instead, we can show only the larger correlations:

 # Only show the large (absolute) correlation values.
 df_corr2 = df_corr[df_corr.abs() > 0.7]
 # Round correlation values to 2 digits.
 df_corr2 = df_corr2.round(2)
 # Transform the table to give a better overview.
 df_corr2 = df_corr2.stack()
 # Remove all values that are 1.0
 df_corr2 = df_corr2[df_corr2 != 1.0]
 # Show the result. Use a DataFrame for pretty printing.
 pd.DataFrame(df_corr2, columns=['Correlation']) 
Plotting the Final Output :

Let us make some scatter plots of the most important signals versus the stock-returns.

 # Plot these signals on the x-axis.
 # Plot the stock-returns on the y-axis
 y_vars = [TOTAL_RETURN_1_3Y]
 # Create the plots.
 g = sns.PairGrid(df_sig_rets.reset_index(), height=4,
                  x_vars=x_vars, y_vars=y_vars, hue=TICKER)

The colours in these plots represent different stock-tickers. The Gross Profit Margin was found further above to be one of the most important signals for predicting the future stock returns for 1-3 year period. The second scatter plot is for the P/NetNet signal. This had a lower R-squared value than the Gross Profit Margin. 


This article tried to implement a basic financial analysis using SimFin and explored the library and its uses. You can perform more such operations on it and understand the use of SimFin even further. You can check my implementation on the colab notebook here. 

Happy Learning!


Download our Mobile App

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox